Skip to main content.
home | support | download

Back to List Archive

Re: Adding files from external site - suggestions?

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Mon Apr 19 2004 - 17:00:29 GMT
On Mon, Apr 19, 2004 at 12:18:25PM -0400, Rob de Santos AFANA wrote:
> So, is there a way via the configuration file to tell Swish-e to index
> this one directory via the "fs" method? and still do the rest of the
> site via spidering?

Yes and No.

I've been wanting to allow IndexDir to look at a scheme and decide what
indexing method to use, so you could say:

   IndexDir prog:///path/to/program  file:///path/to/dir

And it uses -S prog for the first and -S fs for the second.

But what you can do currently is use -S prog like this:

   IndexDir /path/to/spider  /path/to/DirTree /path/to/another/program

and Swish will run multiple programs.  I do that currently on one site
where I spider most of the site, but a lot of the site is generated
dynamically from MySQL tables, so instead of spidering those via the web
I block those with the spider (test_url function) and then run a
separate program for the MySQL data.

There's other ways, too.  On one site I build multiple indexes from the
same data, so I do:

   #!/bin/sh
   SPIDER_QUIET=1 ./spider.pl spiderconfig.pl | gzip > all.gz || exit
   gzip -dc all.gz | ./swish-e -v0 -E -c swish.conf -S prog -i stdin
   gzip -dc all.gz | ./swish-e -v0 -E -c stemming.conf -S prog -i stdin
   gzip -dc all.gz | ./swish-e -v0 -E -c metaphone.conf -S prog -i stdin

So you could also do something similar where you do:

   spider.pl spider.config | gzip > all.gz || exit 1
   DirTree.pl | gzip >> all.gz || exit 1
   gzip -dc all.gz | ./swish-e -v0 -c swish.config

You don't have available all the config options that you can use with -S
fs when using -S prog, but I think it's just as easy (and more powerful)
do emulate those in the <insert favorite scripting language> script.

Now, you might compare that with using merge and see what is faster.
Merge doesn't require parsing the docs, but it does require extracting
out the data from the indexes, sorting and reindexing.  So it may not be
much faster.


-- 
Bill Moseley
moseley@hank.org
Received on Mon Apr 19 10:00:29 2004