I ran into a problem indexing a site that looked perfectly fine. The
header of the file(s) in question contained the META TAG
<META HTTP-EQUIV="content-type" CONTENT="text/html;">
libwww 'concatenates' multiple headers with the same names and since
the header already exists from the http server in the form of
Content-type: text/html
the result stored in the $response->header("content-type")
in 'swishspider' ends up containing "text/html, text/html"
This fails the test on line 50
if( $response->header("content-type") eq "text/html" ) {
so links on the page are not followed.
Changing the perl script to read:
50c50
< if( $response->header("content-type") eq "text/html" ) {
---
> if( $response->header("content-type") =~ m|^text/html| ) {
solves the problem by matching to the string beginning with
'text/html'
Michael
michael@bizsystems.com
BTW, I'd still like to know how to enable FileRules while spidering.
This would be most helpful in eliminating usless index information on
very large sites that have specific interest archives.
Received on Wed Jun 16 11:36:43 1999