Skip to main content.
home | support | download

Back to List Archive

Re: Database searching with swish-e

From: Walter Lewis <lewisw(at)not-real.hhpl.on.ca>
Date: Wed Jan 12 2005 - 15:24:48 GMT
dcha099@ec.auckland.ac.nz wrote:

> I am trying to use Swish-e to replace regular database querying. Since there are
> millions of records, there might be a problem with having millions of files on
> certain filesystems. Is there an easy solution for this?

I believe the standard practice is to set up a script that generates 
"HTML" pages on the fly (without writing them to the filesystem).  These 
are then fed to the spider program (I haven't needed to touch the spider 
code at all.)

You end up with something like this in the conf (the indexing 
configuration) file:
	
	IndexDir spider.pl ./NewsDB2.pl

where NewsDB2.pl makes a connection to the database and record-by-record 
transforms the results into a pseudo-HTML page and passes them to 
STDOUT. The spider picks them up one at a time and the indexing takes place.

If you go to:
	http://www.swish-e.org/current/docs/SWISH-RUN.html
and look at the "prog - general purpose access method", there is an 
example HTML script (in perl) of the key bits in the process.

The key is that the HTTP Path-Name needs to be manipulated to lead back 
to an HTML addressable representation of that record in the database: e.g.
	http://www.mydatabase.org/records.pl?ID=
where you finish the value with the unique $id.

In the latest version of the docs the relevance ranking logic is 
explained.  Note that if you map a particular field or fields from the 
database into a <title> it will be weighted more heavily.  Similarly you 
may choose to export two (or more) copies of specific fields in a crude 
way of weighting the relevance.  Even better methods are anticipated.

Others with more experience in this are welcome to chip in.

Walter Lewis
Halton Hills
Received on Wed Jan 12 07:24:52 2005