I've been trying to track down some weird errors that we're getting from using
spider.pl. It will be humming along ok and then I'll get an error like this:
Warning: Unknown header line: 'th-Name:
Swish-e can't recover and I'm left with no index even though it's indexed lots
of content before this. So it seems that the file output by the spider isn't
completely correct. My guess is that the Content-Length is off (not likely since
that's coming from the server and all the content does make it into the spider's
output file) or that swish-e is encountering some multi-byte characters in the
output and is getting confused somehow. This prevents it from finding the right
end of the document and thus misses the headers of the next document.
Am I right? If so, how can I fix this? When I use swish-e to index a filesystem
with HTML docs that have UTF8 I use a FileFilter that changes UTF8 chars into
HTML entities. Can I do something similar with the spider?
Plus Three, LP
Users mailing list
Received on Tue Mar 24 12:25:13 2009