Skip to main content.
home | support | download

Back to List Archive

Re: input conversion failed

From: J Robinson <jrobinson852(at)not-real.yahoo.com>
Date: Fri Nov 14 2003 - 14:02:01 GMT
Hello All,

Just wanted to let people know that I tracked this bug
down to being related to how I handle the data before
feeding it to SWISH-E. 

Apparently some of the data is not getting to SWISH-E
as intended. I think the problem is not with SWISH-E
but with a library I'm using for my upstream
processing. I'll post back a summary when I figure out
exactly went wrong, if it seems relevant.

Best,
  jrobinson

--- J Robinson <jrobinson852@yahoo.com> wrote:
> Hello Bill and everyone,
> 
> --- moseley@hank.org wrote:
> > > >
> >
> http://www.gnu.org/testimonials/testimonials.ca.html
> > > > > input conversion failed due to input error
> > > > > Bytes: 0xC4 0x3C 0x2F 0x41
> > > > 
> > > > Ok, how are you indexing?
> > > 
> > > -S prog method. The prog is in perl.
> > 
> > If you try the way I have it below do you also get
> > the error?
> > 
> > > > moseley@bumby:~$ wget
> > > >
> >
> http://www.gnu.org/testimonials/testimonials.ca.html
> > > > 2>/dev/null
> > > > moseley@bumby:~$ swish-e -i
> testimonials.ca.html
> > -v0
> 
> 
> Interestingly, I don't get the error then (i'm using
> tcsh):
> 
> [/tmp]% wget
> http://www.gnu.org/testimonials/testimonials.ca.html
> >
> & /dev/null
> [/tmp]% ls  testimonials.ca.html 
> testimonials.ca.html
> [/tmp]% swish-e -i testimonials.ca.html -v0
> (no output).
> 
> Same results with
> http://www.openbsd.com/ko/donations.html 
> 
> > > Which distribution and version of linux are you
> > using?
> > 
> > I tried it on two Debian Sid machines (2.4.21,
> > libxml2 2.5.11)
> > and a Debian Woody 2.4.20, libxml2 2.4.19).
> > 
> > In your -S prog are you using any regular
> > expressions on the content?
> > Or decoding any HTML entities?  
> 
> No, and no. It just gets the data out of a database,
> wraps it in appropriate headers, and pipes it to
> swish-e. Or at least I don't ask it to do any
> conversions or regexes on the content! :)
> 
> I'll email you the relevant scripts offline for your
> testing.
> 
> > My before-coffee-guess is that Perl making some
> > conversion.  I had an 
> > interesting problem once where I was using Perl to
> > split up some text.
> > IIRC, I had HTML entities that were forcing Perl
> > into UTF-8 mode, but 
> > the split I was using ended up splitting the text
> > right in the middle of 
> > a multi-byte UTF-8 character.  Then I was ending
> up
> > with broken 
> > characters.
> > 
> >   http://swish-e.org/archive/5049.html
> 
> Sounds reasonable. Perhaps perl is doing something
> 'bad'. I'm using perl 5.6.1.
> 
> > Is your Perl script something I can try on my
> > machines?  Or perhaps you 
> > can create a small test case?
> 
> We'll send you this offlist.
>   
> > > Let me know if you want more data points and
> I'll
> > get
> > > them for you. For example, I can try building
> the
> > > index on a RH7.2 machine (it currently has
> libxml2
> > > 2.4.19 installed) or with another libxml2
> version.
> > 
> > I really need to spend more time thinking about
> > character encodings.  
> > For example, I'm not clear if/how to get libxml2
> to
> > say what encoding it 
> > has determined the source doc to be in.  Might be
> > helpful to see what 
> > encoding it thinks your Perl program is generating
> > (even though it says 
> > 8859-1 in the <head>).  Another pre-coffee thought
> > is maybe Perl is 
> > converting something int utf-8 but libxml2 is
> > expecting 8859-1 from the 
> > charset setting.
> > 
> > Please post back your findings.
> > 
> > Thanks,
> > -- 
> > Bill Moseley
> > moseley@hank.org
> > 
> 
> Thanks for your help debugging this, Bill.
> 


__________________________________
Do you Yahoo!?
Protect your identity with Yahoo! Mail AddressGuard
http://antispam.yahoo.com/whatsnewfree
Received on Fri Nov 14 14:02:09 2003