Thanks for responding. Yeah, I didn't realize until later that spider.pl had the ability to filter pdfs and such built-in. The Documentation I found for spider.pl (http://www.swish-e.org/current/docs/spider.html) didn't seem too explicit about the types of files it can read and whatnot.
Where can I find better documentation for spider.pl? Or am I just blind?
By the way, the ridiculous settings I've been using- when you're new to this, you use examples you see in the documentation. And I found the FileFilter /bin/cat somewhere and I thought I needed it. Now I realize that it's just an example of syntax :)
>
> From: Bill Moseley <moseley@hank.org>
> Date: 2004/05/28 Fri PM 01:00:18 EDT
> To: adivey1@cox.net
> CC: Multiple recipients of list <swish-e@sunsite.berkeley.edu>
> Subject: Re: [SWISH-E]
>
> On Fri, May 28, 2004 at 09:46:11AM -0700, adivey1@cox.net wrote:
> > *Note* sorry for all the "undisclosed"s. I'm an intern with a gov't
> > contracting agency so I don't know what all is allowed to be public.
>
> We understand. Hey, we are getting used to it.
>
> > # swish-e -c undisclosed.conf -v 3 -S http
> >
> > FileFilter .html "/bin/cat" "'%p'"
>
> Doesn't that qualify for the a "useless use of cat" award?
> Why are you using that?
>
> > 1 file indexed. 1,133 total bytes. 9 total words.
> > Elapsed time: 00:00:08 CPU time: 00:00:00
> > Indexing done!
> >
>
> > - I don't know why that isn't working. Anyway, I switched to the
> > spider.pl method. I didn't edit spider.pl at all, and here is my
> > config file...
>
> What's not working. You mean it's not following any links?
>
>
> > FileFilter .pdf pdftotext "'%p' -"
> > FileFilter .html "/bin/cat" "'%p'"
>
> You shouldn't need either of those. In current versions of spider.pl it
> will automatically filter .pdf if you have pdftotext in your path.
>
>
> > Error: May not be a PDF file (continuing anyway)
> > Error (0): PDF file is damaged - attempting to reconstruct xref table...
> > Error: Couldn't find trailer dictionary
> > Error: Couldn't read xref table
> > http://undisclosed/visions/ESE_Strategy2003.pdf - Using HTML2 parser - (no words indexed)
>
> It may be that you are trying to filter (with FileFilter) something that
> has already been filtered.
>
> > - I'm completely lost. I wish there were some sample configurations... I've been reading Docs all day and don't know what I'm doing wrong. It can't be permissions because I'm running as root. Please help.
>
> Here's the secret:
>
> Look at spider.pl docs and see how to enable some of the debugging
> features -- that will tell you what files are skipped and why.
>
> Then run the spider outside of swish something like:
>
>
> SPIDER_DEBUG=skipped /usr/local/swish-e/spider.pl default http://area_51.gov/ > out.txt
>
> and then you can see what's skipped and why, and then you can look at
> out.txt and see what your content looks like.
>
>
>
> --
> Bill Moseley
> moseley@hank.org
>
>
Received on Wed Jun 2 12:05:48 2004