Skip to main content.
home | support | download

Back to List Archive

Re: Indexing large nbrs of docs

From: Greg Caulton <gcaulton(at)not-real.sympatico.ca>
Date: Sat Jun 02 2001 - 04:38:49 GMT
Wow, I installed version 2.1 and did the same command
i.e.

/export2/is/search/swish-e -i /export2/is/cerner -c
/export2/is/search/user_cerner.config

and the index was built in 2 minutes :-)
I guess I was out of memory...

The -e option did not make it faster in my scenario.

thanks!

Greg


Bill Moseley wrote:
> 
> At 07:44 PM 05/31/01 -0700, Greg Caulton wrote:
> 
> >    Large, well compared to my other indexes :-)
> >
> >    I wish to index a directory with 2800 word docs, of which the total
> >combined size is 720MB.
> 
> I think Jose has indexed somewhere like 600,000 docs.  (Is that right, Jose?).
> 
> >    However the indexing is getting slower and slower as the number of
> >documents indexed increases - and I believe it will run for several
> >hours before slowing to a crawl.
> 
> Hard to tell without more information.  Are you running out of memory when
> indexing?
> 
> Swish 2.1 has a -e economy switch to use less memory, but it's currently
> unclear how much help this adds.  If it keeps you from swapping then it's a
> big help.
> 
> The other issue is with filters.  If you are using a shell or (especially)
> a perl script with FileFilters then, yes, it can be very slow because it
> runs the script for every document.
> 
> FileFilters are smarter now in that you can avoid a shell script or perl
> script with some filters and run the filter program directly.  This still
> uses popen for every document so the shell is still run for every document,
> but it's still much, much, faster than running a perl script for every
> document.
> 
>     FileFilter .doc "/usr/local/bin/catdoc" "-s8859-1 -d8859-1 '%p'"
> 
> Swish 2.1 has a new input method called "prog" where an external program
> feeds documents to swish.  So the external program can be a perl script
> that runs (compiles) only one time and stays running while indexing all
> documents.
> 
> This can be a very significant increase in indexing speed if you *must* use
> a perl or shell script in your processing.
> 
> If you cannot avoid a shell or perl script for filtering, then you should
> probably try using the prog method.  There are examples in the prog-bin
> directory of the 2.1-dev distribution.  But if you are just indexing word
> docs, then try that FileFilter command first and let us know what happens.
> 
> >    Is it possible to merge seperate smaller indexes?
> 
> Yes, but only if your problem is running low on memory.  Otherwise it
> probably won't save you any time.
> 
> But you must first find out if you are running out of memory while indexing.
> 
> Bill Moseley
> mailto:moseley@hank.org
Received on Sat Jun 2 04:40:06 2001