Skip to main content.
home | support | download

Back to List Archive

Re: win2k unknown header problem

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Sep 26 2002 - 18:45:58 GMT
At 09:46 AM 09/26/02 -0700, Matt Kynaston wrote:
>I've narrowed the problem down to the "No-Content: 1" header - if I remove:
>	<meta name="robots" content="nocontents">
>from the first file being spidered (listed below), swish-e is happy.

Yep, was failing to flush the buffer.

It broke in the last update when I made the HTML2 parser the default parser
if not specified and libxml2 linked in.  So the code that deals with
NoContents didn't know to flush the input buffer.

I always wonder how useful NoContents is.  Does indexing the path name if
there's not a <title> very helpful?

We will try to get a windows binary out soon with this patch.  It will be
in 2.2.1.

For the short term you can edit spider.pl.  Look for:

    $headers .= "No-Contents: 1\n" if $server->{no_contents};
    print "$headers\n$$content";

and maybe try something like commenting out the No-Contents: header.

    #$headers .= "No-Contents: 1\n" if $server->{no_contents};
    print "$headers\n$$content";

But that will index the doc.

Or if you want to emulate the process add this code a bit higher up than
the above:

    $server->{counts}{'Total Docs'}++;

    # add this code
    if ( $server->{no_contents} ) {
        my $title = $response->title || '';
        $$content = "<title>$title</title>";
    }

That last code is probably better than setting No-Contents and sending the
entire doc just to be thrown away.  Oh well.



Index: src/index.c
===================================================================
RCS file: /cvsroot/swishe/swish-e/src/index.c,v
retrieving revision 1.198.2.1
diff -u -r1.198.2.1 index.c
--- src/index.c 24 Sep 2002 20:41:33 -0000      1.198.2.1
+++ src/index.c 26 Sep 2002 18:00:00 -0000
@@ -479,7 +479,7 @@
 
 
 #ifdef HAVE_LIBXML2
-    if (fprop->doctype == HTML2)
+    if (fprop->doctype == HTML2 || !fprop->doctype)
         return parse_HTML( sw, fprop, fi, buffer );
 #endif



-- 
Bill Moseley
mailto:moseley@hank.org
Received on Thu Sep 26 18:50:50 2002