Skip to main content.
home | support | download

Back to List Archive

HTTP Indexing Times for Different OSs and Swish Versions

From: Deane Barker <deane.barker(at)not-real.bankfirstcorp.com>
Date: Fri Jan 11 2002 - 20:12:08 GMT
All: 

Here's an interesting experiment.  I'm running Swish-E on two machine that
are side-by-side on my desk.  Both have the latest versions of Swish-E from
the download page for their respective OS (downloaded and installed on both
in the last five days).  Here are the hardware specs and what the "Swish-E
-V" (version label) command returns for both of them:

Machine A:  Athlon 750 Mhz, 224MB RAM running Mandrake Linux 8.1  ("SWISH-E
2.0") 

Machine B:  Athlon 1 GHz, 384MB RAM running Windows XP Home  ("SWISH-E
2.1-dev-24") 

These two machines share the same internet connection (both connected to the
same gateway -- not one sharing the other's connection), and both present
comparable performance when surfing the web and downloading files.

Okay, they both have Apache, so I set them both loose on the Apache manual
via the file system.  Here's how long it took:

Machine A (Linux / Swish 2.0):  3 seconds 

Machine B (Windows / Swish 2.1-dev-24):  3 seconds 

Perfectly comparable...when indexing via the file system 

Here's where it gets interesting: I set up the swishspider and unleashed
them both on the same web site (very small -- just 19 unique pages) via HTTP
crawl at the same general time (one just after another, late at night when
volume was low; web server logs indicate that the spider was the only active
session on the web site at the time).

The time differences were massive: 

Machine A (Linux / Swish 2.0):  21 minutes  (that's MINUTES, not seconds...)


Machine B (Windows / Swish 2.1-dev-24):  14 seconds 

This is not a fluke -- I did the same test several times and got the same
result. 

The test is also informally mirrored.  I have Swish-E running at work on
Windows 2000 Professional, and a friend has it running on Mandrake Linux
8.1, both with the same version numbers (Windows at 2.1 dev, Linux at 2.0).
Performance in both instances is representative of the respective times
indicated above.

So, where does the difference come from?  It has to be something to do with
the spider since they have the same performance indexing via the file
system.  Is it:

(1)  A difference in the versions?  I know that spidering and indexing time
was improved in the new release, but improved THAT much?  Wow.

(2)  A difference in the underlying operating systems?  Could Windows and
Linux handle HTTP requests and HTML parsing THAT differently?

I researched this on the discussion group and found this post:

http://swish-e.org/archive/2122.html

This indicates that the system will page at the tail end of the crawl when
it says "Writing index entries...".  However, that's not the problem here.
The Linux machine is just slow from page to page when indexing.  The output
says something like, "Retrieving page http://blah.blah..." and it just
sits...and sits...and sits...and then moves on.

Any ideas? 

Deane Barker 
The Sling and Rock Design Group 
www.slingandrock.com 

 
Received on Fri Jan 11 20:12:44 2002