Re: [swish-e] index a list of files

From: Bill Moseley <moseley(at)>
Date: Wed Jul 09 2008 - 05:01:19 GMT
On Tue, Jul 08, 2008 at 10:34:29PM -0400, Brad Bauer wrote:
> RE: Caching - I am attempting to avoid downloading pdfs since it is very
> time consuming compared to the fs method. (They do, after all, already exist
> on the server)  Using the spider is taking 20+ minutes for only a small
> section of the site, where as using the fs setup I am able to index the
> entire server in about 5 minutes.

The web server is running on the same machine?  That seems hard to
believe that the web server would be that much slower at fetching the
files to make a difference.  I'd think most of the time (for either
mode) would be extracting the pdf and indexing.  Fetching over http
vs. the file system would seem like background noise.

But, I'm just guessing.

Nice thing about the -s prog method is you can, well, you can write a
program do do your indexing.  So you could use the spider and when the
spider returns a link to a pdf you could abort the fetch and grab the
content from the local disk.  Might take a bit of tweaking of the
spider, but it's very possible and not too hard.

Is your content dynamically generated?  Is that why you are spidering
instead of "spidering" the file system?

Bill Moseley

Received on Wed Jul 9 01:01:15 2008