Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] Searching remote mail archive problem

From: Tian Xinchun <tianxc(at)not-real.ihep.ac.cn>
Date: Thu Mar 06 2008 - 08:16:08 GMT
Hi Bill,

Thanks for your help, See below.

> ------------------------------
> 
> Message: 6
> Date: Wed, 5 Mar 2008 06:11:42 -0800
> From: Bill Moseley <moseley@hank.org>
> Subject: Re: [swish-e] Searching remote mail archive problem
> To: Swish-e Users Discussion List <users@lists.swish-e.org>
> Message-ID: <20080305141142.GA6428@hank.org>
> Content-Type: text/plain; charset=utf-8
> 
> On Wed, Mar 05, 2008 at 08:03:06PM +0800, Tian Xinchun wrote:
> > Hi Peter?
> > 
> > I am sorry that I can not quite understand what you mean. Taking a example:
> > 
> > $swish-e -c swish.conf -S prog
> > Indexing Data Source: "External-Program"
> > Indexing "spider.pl"
> > External Program found: /usr/local/lib/swish-e/spider.pl
> > /usr/local/lib/swish-e/spider.pl: Reading parameters from 'spider.conf'
> > https://www.lbl.gov/lists.archives/theta13-eng.archive/:1: error:
> > htmlParseStartTag: invalid element name
> > <?xml version="1.0" encoding="ISO-8859-1"?>
> >  ^
> > https://www.lbl.gov/lists.archives/theta13-eng.archive/:2: error: Misplaced
> > DOCTYPE declaration
> > <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
> > ^
> 
> You have two errors.  That first one above is simply saying you are
> trying to index an xml document with Libxml's *html* parser.
> So you need to use the XML* parser type.
>

Actually, I have tried using XML*, but I still got the same error messages.

> > Warning: Unknown header line: 'ive/author.html' from program spider.pl
> > err: External program failed to return required headers Path-Name:
> 
> What version of swish and spider.pl are you using?
> You can look at spider.pl in an editor and find:
  SWISH-E: 2.4.5
  spider.pl: 1.26

> 
> $VERSION = sprintf '%d.%02d', q$Revision: 1900 $ =~ /: (\d+)\.(\d+)/;
> 
> The way -S prog works is that each file sent to swish has a byte 
> count in the -S prog header.  This is the size of the document in bytes.
> Once swish finds the blank link that indicates the end of the -S prog
> header (which defines the filename, length, and possibly date and
> parser type) it will read in the document in chunks until it reads in
> that byte count.
> 
> When you get that "Unknown header line" it means that the byte count
> for a document was wrong.  This typically means that, in this case,
> spider.pl is reporting an incorrect count of bytes in the file -- and
> that has been due to wide characters in the byte string.
> 
> As far as I know, that's a problem with spider.pl -- because,
> regardless of the file's encoding (and even if reported incorrectly)
> it should be able to convert the characters string into a byte string
> and tell you the correct length.
>

Thanks for the information, and any plan on fixing it.

Best Regards,

Carl

> -- 
> Bill Moseley
> moseley@hank.org
> 
> Unsubscribe from or help with the swish-e list: 
>    http://swish-e.org/Discussion/
> 
> Help with Swish-e:
>    http://swish-e.org/current/docs
> 
> ------------------------------
> 
> _______________________________________________
> Users mailing list
> Users@lists.swish-e.org
> http://lists.swish-e.org/listinfo/users
> 
> End of Users Digest, Vol 15, Issue 3
> ************************************

====================================================         
                     Dr. Xinchun Tian
Room A601, Mobile: 13426390768
Experimental Physics Center, IHEP, CAS
Beijing, 100049
Homepage: http://viviseayu.bb.iyaya.com/index.php
====================================================
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Thu Mar 6 03:17:07 2008