Nick scribbled on 5/6/05 2:54 PM:
> I currently have swish-e 2.4.3 up and working. It appears to be working
> fine (with a small set of files) but indexing all my files is taking a
> really long time.
you're right. should not be taking that long.
>
> I am somewhat confused at the best (for speed) way to setup indexing. I
> have read through all the docs (or at least I think I did), and I am still
> somewhat confused at the best way to setup the filters.
as luck has it, I spent the morning working on the docs. So at least I have it
fresh in my head (which may not mean much).
swish-e does not know about non-text files like .pdf, .doc, .xls and .ppt. You
need some 3rd party programs to convert those to text so that swish-e can index
them. For the windows distrib of swish-e, some of those 3rd party apps are
bundled in: xpdf and catdoc (see the note here:
http://swish-e.org/download/index.html). Since you're using Linux and mouting
the windows volume remotely, you need to install the 3rd party apps for Linux. I
think the filters/README file talks about that (I haven't gotten to that doc
revision yet...).
You're also calling swish-e with the default -S fs method (since you don't
specify one explicitly). You probably want -S prog, in order to get your docs
filtered with the 3rd party apps.
A few things I would try:
1. make sure the SWISH::Filter class is in your Perl include path:
% export PERL5LIB=/usr/local/lib/swish-e # bash, bourne shells
% setenv PERL5LIB /usr/local/lib/swish-e # csh, tcsh
2. index with this command instead:
swish-e -c /etc/swish.conf -S prog -i DirTree.pl
3. if you're going to index every night, but the binary docs (pdf, .doc, etc)
don't change that often, consider caching the filtered output. The filtering
causes the most overhead: a new forked process for each doc.
you can cache output with the DirTree.pl script, or roll your own.
4. like I mentioned, I'm working on the docs even now, so if there are specific
ways you think that they could be improved, post back to the list.
--
Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com
Received on Fri May 6 13:09:54 2005