Skip to main content.
home | support | download

Back to List Archive

Found and fixed a small bug in SWISH-E 1.3.1

From: Steve van der Burg <steve.vanderburg(at)not-real.LHSC.ON.CA>
Date: Mon Feb 01 1999 - 16:01:07 GMT
I've been using SWISH-E for a few weeks now, and I'm getting
ready to cut over from Harvest to it for production searches at
my site.  During some final testing, I found that swish-e
wasn't able to grab the title from some HTML documents.

For example, a document that starts like this (this is from my site):

   <META HTTP-EQUIV="Content-Type"
CONTENT="text/html; charset=iso-8859-1">
   <TITLE>Cardiac Care(LHSC)</TITLE>

...ends up with no title in the swish-e index that I build.

After looking at the index to make sure that the document
was in it (it was), but that the title was missing, I went looking
through the swish-e source to see what was happening.  I
found that the content-type test in http.c was failing, and
parsetitle() wasn't getting called.
It seems that when the swish spider retrieves the document
from my web server (Apache 1.3.4, platform for it and swish-e
is Solaris 2.6 (Sparc), Perl 5.004_04), the spider's
<filename>.response file looks like this:

text/html, text/html; charset=iso-8859-1

Note the long content type (the first "text/html" is returned by
the server, and all the rest comes from the document's META
HTTP-EQUIV header).  The call to get() was returning the long
content-type line above, and the strcmp in http.c was failing
because it checks against plain old "text/html", so parsetitle
was never getting called

The fix is simply to check the start of content-type for
"text/html".  Here's a diff for http.c:

<                       if (strcmp(contenttype, "text/html") == 0) {
>                       if (strncmp(contenttype, "text/html", 9) == 0)

Steve van der Burg
Technical Analyst, Information Services
London Health Sciences Centre
London, Ontario, Canada
Tel:  +1 519 685-8300 x 35559
Received on Mon Feb 1 07:57:00 1999