Skip to main content.
home | support | download

Back to List Archive

Re: maximum size of files

From: Dave Stevens <dstevens(at)not-real.roaddog.com>
Date: Wed Jan 07 2004 - 09:42:28 GMT
> All released versions of SWISH-E support index files up to 2 GB.
>
> The index file size will depend completely on what you choose to store
> in the index.  My only advice is to test SWISH-E with your data.


Indeed.  Though my current project is more suited for Nutch, I'm still
using SWISH-E for proof of concept and for early adopter type users.  I
have a couple (of six) indices of just over a million pages total that are
near the 2GB limit and found out the hard way about the limit and how many
files can be in an index.  ;-)

Basically the more non text or non html docs (pdf, xl, doc) and the larger
the description text the bigger the file.  I worked around the file size
and crawl duration limitations (some crawls are 120 hrs plus) by
segmenting the indices by content type, sort of poor man's DMOZ style of
organizing.

SWISH-E works very well in it's intended application.  I've got both the
current snap of Nutch and the 2.4.0 release of SWISH-E making the same
crawls for comparison sake.  Two vastly different tools for different
solutions.

Dave
Received on Wed Jan 7 09:43:00 2004