*Note* sorry for all the "undisclosed"s. I'm an intern with a gov't contracting agency so I don't know what all is allowed to be public.
This first method doesn't go deep in the directories at all. It just does robots.txt and the root.
- Here is my configuration file. I run it by...
# swish-e -c undisclosed.conf -v 3 -S http
IndexFile undisclosed.index
IndexName "Undisclosed"
IndexPointer http://undisclosed/
IndexAdmin webmaster
IndexDir http://undisclosed/
IndexContents HTML* .htm .html
IndexContents TXT* .txt .pdf
StoreDescription HTML* <body> 20000
StoreDescription TXT* <body> 20
IgnoreWords www http a an the of and or
MetaNames swishdocpath swishtitle
FileFilter .pdf pdftotext "'%p' -"
FileFilter .html "/bin/cat" "'%p'"
- My results are as follows...
Parsing config file 'undisclosed.conf'
Indexing Data Source: "HTTP-Crawler"
Indexing "http://undisclosed/"
Now fetching [http://undisclosed/robots.txt]...Status: 404.
retrieving http://undisclosed/ (0)...
sleeping 5 seconds before fetching http://undisclosed/
Now fetching [http://undisclosed/]...Status: 200. text/html
- Using DEFAULT (HTML2) parser - (3 words)
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 6 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: Complete
Writing word hash: Complete
Writing word data: Complete
6 unique words indexed.
5 properties sorted.
1 file indexed. 1,133 total bytes. 9 total words.
Elapsed time: 00:00:08 CPU time: 00:00:00
Indexing done!
- I don't know why that isn't working. Anyway, I switched to the spider.pl method. I didn't edit spider.pl at all, and here is my config file...
IndexFile undisclosed.index
IndexName "Undisclosed"
IndexPointer http://undisclosed/
IndexAdmin webmaster
IndexDir spider.pl
SwishProgParameters default http://undisclosed/
IndexContents HTML* .htm .html
IndexContents TXT* .txt .pdf
StoreDescription HTML* <body> 20000
StoreDescription TXT* <body> 20
IgnoreWords www http a an the of and or
MetaNames swishdocpath swishtitle
FileFilter .pdf pdftotext "'%p' -"
FileFilter .html "/bin/cat" "'%p'"
- But this way, it doesn't get any PDFs! See the results...
# swish-e -c undisclosed.conf -v 3 -S prog
Parsing config file 'undisclosed.conf'
Indexing Data Source: "External-Program"
Indexing "spider.pl"
External Program found: /usr/local/lib/swish-e/spider.pl
/usr/local/lib/swish-e/spider.pl: Reading parameters from 'default'
http://undisclosed/no_flash.html - Using HTML2 parser - (130 words)
(about 15 other html pages...)
http://undisclosed/working/index.html - Using HTML2 parser - (315 words)
--- *** THEN THE SYSTEM HANGS HERE FOR ABOUT 2 OR 3 MINUTES! *** ---
Error: May not be a PDF file (continuing anyway)
Error (0): PDF file is damaged - attempting to reconstruct xref table...
Error: Couldn't find trailer dictionary
Error: Couldn't read xref table
http://undisclosed/visions/ESE_Strategy2003.pdf - Using HTML2 parser - (no words indexed)
- Why does it do that? I have specified to use FileFilter=pdftotext and IndexContents=TXT* .txt .pdf! So let's tweak the config file then!!
IndexContents HTML* .htm .html pdf
IndexContents TXT* .txt
FileFilter .pdf pdf2html "'%p' -"
(everything else the same)
- Then I'll run swish-e again and get the following...
# swish-e -c undisclosed.conf -v 3 -S prog
Parsing config file 'undisclosed.conf'
Indexing Data Source: "External-Program"
Indexing "spider.pl"
External Program found: /usr/local/lib/swish-e/spider.pl
/usr/local/lib/swish-e/spider.pl: Reading parameters from 'default'
http://undisclosed/no_flash.html - Using HTML2 parser - (130 words)
(about 15 other html pages...)
http://undisclosed/working/index.html - Using HTML2 parser - (315 words)
--- *** THEN THE SYSTEM HANGS HERE FOR ABOUT 5 OR 6 MINUTES! *** ---
sh: line 1: pdf2html: command not found
http://undisclosed/visions/ESE_Strategy2003.pdf - Using HTML2 parser - (no words indexed)
- Pdf2HTML is in SWISH-E's filters directory!
- I'm completely lost. I wish there were some sample configurations... I've been reading Docs all day and don't know what I'm doing wrong. It can't be permissions because I'm running as root. Please help.
Thanks
Received on Fri May 28 09:47:50 2004