Re: [swish-e] Searching remote mail archive problem

From: Bill Moseley <moseley(at)>
Date: Wed Mar 05 2008 - 14:11:42 GMT
On Wed, Mar 05, 2008 at 08:03:06PM +0800, Tian Xinchun wrote:
> Hi Peter´╝î
> I am sorry that I can not quite understand what you mean. Taking a example:
> $swish-e -c swish.conf -S prog
> Indexing Data Source: "External-Program"
> Indexing ""
> External Program found: /usr/local/lib/swish-e/
> /usr/local/lib/swish-e/ Reading parameters from 'spider.conf'
> error:
> htmlParseStartTag: invalid element name
> <?xml version="1.0" encoding="ISO-8859-1"?>
>  ^
> error: Misplaced
> DOCTYPE declaration
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
> ^

You have two errors.  That first one above is simply saying you are
trying to index an xml document with Libxml's *html* parser.
So you need to use the XML* parser type.

> Warning: Unknown header line: 'ive/author.html' from program
> err: External program failed to return required headers Path-Name:

What version of swish and are you using?
You can look at in an editor and find:

$VERSION = sprintf '%d.%02d', q$Revision: 1900 $ =~ /: (\d+)\.(\d+)/;

The way -S prog works is that each file sent to swish has a byte count
in the -S prog header.  This is the size of the document in bytes.
Once swish finds the blank link that indicates the end of the -S prog
header (which defines the filename, length, and possibly date and
parser type) it will read in the document in chunks until it reads in
that byte count.

When you get that "Unknown header line" it means that the byte count
for a document was wrong.  This typically means that, in this case, is reporting an incorrect count of bytes in the file -- and
that has been due to wide characters in the byte string.

As far as I know, that's a problem with -- because,
regardless of the file's encoding (and even if reported incorrectly)
it should be able to convert the characters string into a byte string
and tell you the correct length.

Bill Moseley

Users mailing list
Received on Wed Mar 5 09:11:44 2008