Skip to main content.
home | support | download

Back to List Archive

Re: SWISHE Perl module - index headers

From: Alex Lyons <ajlyons(at)not-real.sercoassurance.com>
Date: Fri Nov 02 2001 - 16:25:52 GMT
Bill,

Many thanks for adding IndexName to the set of headers accessible from 
the Perl module, and making them consistent.  I downloaded 
swish-e-2.1-dev-24-2001-11-02.tar.gz today and it seems to work fine 
(after I added -lz to Makefile.PL !).

> How does the perl module make it more portable?

Probably not a big issue on most modern platforms, but it avoids having 
to fork/exec the swish-e program: the Perl documentation describes how 
to do this while avoiding a shell, but I haven't tried it on anything 
other than Unix.

> I do hope you validate the path.

Hmm... What validation does SwishOpen do?  Surely it doesn't allow a 
shell to see the index file name?  I had a simple -r check when I was 
only allowing a single index but I took it out in preparation for 
allowing multiple indexes like index=file1+file2 Perhaps I'll put it 
back.

Some other comments:

The TXT2 parser couldn't cope with empty files returned using the "prog" 
method: my Perl spider returns empty files (actually Content-Length: 1 
containing a single newline) if No-Contents: 1 is set. I had to revert 
to TXT in this case.  The error showed up as a broken pipe, presumably 
caused by swish-e aborting.

It would be useful if the "prog" spider could tell swish-e what parser 
(TXT,HTML,XML,TXT2, etc) to use for each file sent: then I wouldn't need 
all that IndexContents stuff in my conf file, and sometimes there is no 
filename suffix anyway (eg: spider-generated directory indexes don't end 
in ".html").  How about adding a "Swish-Parser:" header (or use the 
standard MIME "Content-Type:" if you plan to eventually remove the 
distinction between TXT and TXT2, etc, by moving completely to the 
libxml2 parser)

With the introduction of the libxml2 parser and the resulting increase 
in size of the executable and/or Perl DLL (.so) (I eventually did as 
suggested and compiled swish-e twice, with and without libxml2, but what 
a hassle!), I would suggest that the time has probably come to split 
swish-e into an "indexer" and a smaller "searcher" that doesn't need all 
the parsing stuff.  In fact, the indexer probably doesn't need the 
built-in directory/web crawling facilities now that you have the "prog" 
method and a range of Perl spiders that seem to do the job.

Hope these comments help.

Alex Lyons.



This e-mail and any attachments may contain confidential and/or
privileged material; it is for the intended addressee(s) only. If you
are not a named addressee, you must not use, retain or disclose such
information.

Serco cannot guarantee that the e-mail or any attachments are free
from viruses.

Serco Group plc. Registered in England and Wales. No: 2048608
Registered Office: Dolphin House, Windmill Road, Sunbury-on-Thames,
TW16 7HT, United Kingdom.
Received on Fri Nov 2 16:27:23 2001