Skip to main content.
home | support | download

Back to List Archive

Re:

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri May 28 2004 - 17:17:49 GMT
On Fri, May 28, 2004 at 09:46:11AM -0700, adivey1@cox.net wrote:
> *Note* sorry for all the "undisclosed"s. I'm an intern with a gov't
> contracting agency so I don't know what all is allowed to be public.

We understand.  Hey, we are getting used to it.

> # swish-e -c undisclosed.conf -v 3 -S http
> 
> FileFilter		.html "/bin/cat"   "'%p'"

Doesn't that qualify for the a "useless use of cat" award?
Why are you using that?

> 1 file indexed.  1,133 total bytes.  9 total words.
> Elapsed time: 00:00:08 CPU time: 00:00:00
> Indexing done!
> 

> - I don't know why that isn't working. Anyway, I switched to the
> spider.pl method. I didn't edit spider.pl at all, and here is my
> config file...

What's not working.  You mean it's not following any links?


> FileFilter 		.pdf pdftotext "'%p' -"
> FileFilter		.html "/bin/cat"   "'%p'"

You shouldn't need either of those.  In current versions of spider.pl it
will automatically filter .pdf if you have pdftotext in your path.


> Error: May not be a PDF file (continuing anyway)
> Error (0): PDF file is damaged - attempting to reconstruct xref table...
> Error: Couldn't find trailer dictionary
> Error: Couldn't read xref table
> http://undisclosed/visions/ESE_Strategy2003.pdf - Using HTML2 parser -  (no words indexed)

It may be that you are trying to filter (with FileFilter) something that
has already been filtered.

> - I'm completely lost. I wish there were some sample configurations... I've been reading Docs all day and don't know what I'm doing wrong. It can't be permissions because I'm running as root. Please help.

Here's the secret:

Look at spider.pl docs and see how to enable some of the debugging
features -- that will tell you what files are skipped and why.

Then run the spider outside of swish something like:


   SPIDER_DEBUG=skipped /usr/local/swish-e/spider.pl default http://area_51.gov/ > out.txt

and then you can see what's skipped and why, and then you can look at
out.txt and see what your content looks like.



-- 
Bill Moseley
moseley@hank.org
Received on Fri May 28 10:17:50 2004