Skip to main content.
home | support | download

Back to List Archive

Re:

From: <adivey1(at)not-real.cox.net>
Date: Wed Jun 02 2004 - 19:05:48 GMT
Thanks for responding. Yeah, I didn't realize until later that spider.pl had the ability to filter pdfs and such built-in. The Documentation I found for spider.pl (http://www.swish-e.org/current/docs/spider.html) didn't seem too explicit about the types of files it can read and whatnot.

Where can I find better documentation for spider.pl? Or am I just blind?

By the way, the ridiculous settings I've been using- when you're new to this, you use examples you see in the documentation. And I found the FileFilter /bin/cat somewhere and I thought I needed it. Now I realize that it's just an example of syntax :)
> 
> From: Bill Moseley <moseley@hank.org>
> Date: 2004/05/28 Fri PM 01:00:18 EDT
> To: adivey1@cox.net
> CC: Multiple recipients of list <swish-e@sunsite.berkeley.edu>
> Subject: Re: [SWISH-E]
> 
> On Fri, May 28, 2004 at 09:46:11AM -0700, adivey1@cox.net wrote:
> > *Note* sorry for all the "undisclosed"s. I'm an intern with a gov't
> > contracting agency so I don't know what all is allowed to be public.
> 
> We understand.  Hey, we are getting used to it.
> 
> > # swish-e -c undisclosed.conf -v 3 -S http
> > 
> > FileFilter		.html "/bin/cat"   "'%p'"
> 
> Doesn't that qualify for the a "useless use of cat" award?
> Why are you using that?
> 
> > 1 file indexed.  1,133 total bytes.  9 total words.
> > Elapsed time: 00:00:08 CPU time: 00:00:00
> > Indexing done!
> > 
> 
> > - I don't know why that isn't working. Anyway, I switched to the
> > spider.pl method. I didn't edit spider.pl at all, and here is my
> > config file...
> 
> What's not working.  You mean it's not following any links?
> 
> 
> > FileFilter 		.pdf pdftotext "'%p' -"
> > FileFilter		.html "/bin/cat"   "'%p'"
> 
> You shouldn't need either of those.  In current versions of spider.pl it
> will automatically filter .pdf if you have pdftotext in your path.
> 
> 
> > Error: May not be a PDF file (continuing anyway)
> > Error (0): PDF file is damaged - attempting to reconstruct xref table...
> > Error: Couldn't find trailer dictionary
> > Error: Couldn't read xref table
> > http://undisclosed/visions/ESE_Strategy2003.pdf - Using HTML2 parser -  (no words indexed)
> 
> It may be that you are trying to filter (with FileFilter) something that
> has already been filtered.
> 
> > - I'm completely lost. I wish there were some sample configurations... I've been reading Docs all day and don't know what I'm doing wrong. It can't be permissions because I'm running as root. Please help.
> 
> Here's the secret:
> 
> Look at spider.pl docs and see how to enable some of the debugging
> features -- that will tell you what files are skipped and why.
> 
> Then run the spider outside of swish something like:
> 
> 
>    SPIDER_DEBUG=skipped /usr/local/swish-e/spider.pl default http://area_51.gov/ > out.txt
> 
> and then you can see what's skipped and why, and then you can look at
> out.txt and see what your content looks like.
> 
> 
> 
> -- 
> Bill Moseley
> moseley@hank.org
> 
> 
Received on Wed Jun 2 12:05:48 2004