Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] index a list of files

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Jul 09 2008 - 17:42:59 GMT
On Wed, Jul 09, 2008 at 09:47:13AM -0400, Brad Bauer wrote:
> 
> Perhaps there is something else at play slowing it down.  While trying to
> get the spider working I reduced the SwishSpiderConfig.pl settings to a bare
> minimum, so any timings are at their default.  What are the default timings
> the spider uses?  Can you recommend good options for the timing related
> settings?

You would have to check, but I think there's a default delay between
requests for the spider (in a questionable attempt at making the
spider be nice to the web server).  So, make sure "delay_sec" is set
to zero, I think.

For local spidering, I'd set delay_set to zero and make sure
keep-alives are enabled on the web server and the spider.

Again, I can't imagine that fetching the content over HTTP on a local
machine is so significant that it's the problem.

> I'll look into modifying spider.pl, but I am no perl guru so I might take an
> easier route: I am thinking I can just adjust SwishSpiderConfig.pl#test_url
> to append each .pdf URL it encounters to a log file and return false for
> that file.  Then I will probably modify file.pl (since it is such a simple
> file) to index the pdfs saved in the log file.  Do you see any potential
> issues with that?

Whatever works for you.  Path of least resistance is always good.  I
would first just make sure there's no delays and make sure you are
comparing apples.  I'd "spider" just a pdf file (so it only indexes
one file) and compare that to indexing that file with the file system.
Make sure the resulting indexes have the same content.  You have to
expect some additional overhead with spidering (especially with a
single file where keep-alive doesn't do any good).

But, if there's a huge amount of difference there then I'd start
wondering about where it's happening.  Maybe back off and see how long
wget or "GET" (which uses Perl's LWP like the spider does) takes to
fetch the pdf.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Wed Jul 9 13:42:57 2008