Skip to main content.
home | support | download

Back to List Archive

Re: I wanna be like Mike....

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Jan 23 2002 - 16:37:07 GMT
[It's one of these days.  Why does ^d in pine delete the next character,
yet ^d in Eudora delete the entire message I had spent 1/2 hour on?  And
another thing, why does my laptop shut down a minute after it said I had
almost two hours left.  Argh.]

At 06:43 AM 01/23/02 -0800, Rich Thomas wrote:
>I'm trying to index a large library catalog and allow keyword searches on
>several million records.  I've been able to index a large part of it but my
>search results only gives me the file location and Relevancy and Size ie:
>
>E/E/G/0/819 University at Buffalo Libraries Web Catalog
>     Relevancy Score: 1000 Size of Document: 999 Bytes
>
>How do I get output that looks like a search on the swish-e discussion list?
>Such as Title..rank..and a defined amount of text describing the document?
>Is this something I configure in the swish.config file or is it something
>I'm not telling the swish-cgi.pl program to display?

That example above looks like it has both the rank (relevancy) and the title.

Let's see if I can remember what I had said before ^d:

First, before you spend too much time with the interface, I'd make sure
that swish will work for you.  Millions of files is on the high end of
swish's limit, probably.  Swish keeps data in RAM for every word indexed,
so the more files and the more words you have the more memory you will
need.  You do NOT want to start swapping as that will kill your indexing
speed, and not make your machine too happy either.

You can use -e to reduce memory requirements to some degree, at the expense
of disk access and indexing time.  You will need to test to see what works
best for you.  Try 100,000 files first.  Then 1/2 million.  (Some day -m
will work to limit the number while indexing!)

Also, be very careful what you are indexing.  That is, make sure you index
only words that you know you will want to search.  For example, if you have
an ID or some unique name in each file that identifies that file, there's
no point in indexing that word/id.  No point indexing primary keys.  Also,
be very careful about indexing numbers that may be unique in each file
unless you know you will need to search for them.  I'll show how to test
what's indexed below.

Back to your specific question about the summary.  perldoc swish.cgi says:

    Swish-e can store part of all of the contents of
    the documents as they are indexed, and this
    "document description" can be returned with search
    results.

       # Store the text of the documents within the swish index file
       StoreDescription HTML <body> 100000

    Adding the above to your `swish.conf' file tells
    swish-e to store up to 100,000 characters from the
    body of each document within the swish-e index.

Are you setting StoreDescription in your swish.conf file?  If so, here's
some debugging suggestions:

Keep in mind that swish-e and the swish.cgi (and Apache) are separate
things.  Trying to debug them all at once can sometimes be the hard way to
go.  I think it's best to first make sure that you are indexing what you
think you are indexing, and then worry about the CGI script.

I find it very useful to set up my swish.conf file and then index a single
file and see exactly what's indexed.  Use the -T trace option for this:

  ./swish-e -c swish.conf -i test.html -T indexed_words

will show exactly what words are indexed (and what metanames are used for
each word).

  ./swish-e -c swish.conf -i test.html -T properties

will show exactly what properties are being stored in the index (the .prop
file).  These properties will be available to the swish.cgi script for
display and for sorting.

That should confirm that the settings in your swish.conf file are
generating the index you think is being created.

Now, in your case, with millions of documents, and if your source is in a
database, you might think about not using StoreDescription at all, and
instead modify the swish.cgi script to read the database for the
description.  This will save space in the .prop file if disk space is a
concern.  Probably not.  If your description or summary is more than a 100
chars or so, then I'd build siwsh with zlib support.  When built with zlib
swish will compress the larger properties on disk, using less disk space
and making i/o faster.

>Also... Why does Apache complain when I try to use the swish.cgi script
>included in the latest distribution that the following problem has occured:
>
>Can't locate Date/Calc.pm in @INC (@INC contains: modules
>/usr/perl5/5.00503/sun4-solaris /usr/perl5/5.00503

That's not Apache.  That's perl telling you that the Date::Calc module has
not been installed.  If you don't know how to quickly install modules then
I'd wait on that and disable that feature in the swish.cgi script.

Change:

  date_ranges     => {
      property_name   => 'swishlastmodified',   
      ...

to

  Xdate_ranges     => {
      property_name   => 'swishlastmodified',   
      ...

Finally, if you need more help read:

http://swish-e.org/2.2/docs/INSTALL.html#When_posting_please_provide_the_

You will probably get a better answer, and a faster answer if you are able
to post a very, very, small example config file, and a very small source
file that shows the problem you are having, and cut-n-paste the output
that's not what you expect.

Thanks & good luck,


-- 
Bill Moseley
mailto:moseley@hank.org
Received on Wed Jan 23 16:38:06 2002