Skip to main content.
home | support | download

Back to List Archive

Re: XML2 parser error?

From: Jonas Wolf <JOWOLF(at)not-real.uk.ibm.com>
Date: Fri Jul 30 2004 - 09:19:53 GMT
First of all - yes, libxml2 aborts processing, and thus only the part of 
the file preceding the "strange" character is indexed, which is my 
problem.

Second of all, I have a little example.

I'm using the prog option to run through html_entities, and here's the 
code for index.pl:

#!/usr/bin/perl -w

use strict;
use HTML::Entities();

$/ = undef;
 
open (FH, "test.txt") or die $!;
my $contents = <FH>;
close (FH);
 
my $title = $_;
my $type;
 
$contents = "<xml>\n" . HTML::Entities::encode_entities($contents) . 
"\n</xml>";
my $size = length $contents;

print "Path-Name: test.txt\nContent-Length: $size\nDocument-Type: 
XML*\n\n".$contents;

1;

Then, there's my little test file test.txt which contains some ASCII 
control characters (I'm not sure whether this will be sent properly, but 
if not, just create a file with some ASCII characters below 32):

          ????    "  R                     ?

Then, to get libxml2 error messages, we need a index.cfg (is there any way 
to do this from the command line?):
ParserWarnLevel 3

Then, run swish-e:
# perl -w index.pl | swish-e -S prog -i stdin -v 3 -c index.cfg
Parsing config file 'index.cfg'
Indexing Data Source: "External-Program"
Indexing "stdin"
test.txt - Using XML2 parser - test.txt:2: error: xmlParseCharRef: invalid 
xmlChar value 2
&#2;       &#8;   ????    &quot;&#2;  R&#3;              &#1;       ?
    ^
 (no words indexed)

Removing very common words...
no words removed.
Writing main index...
err: No unique words indexed!
.

As you can see, the &#n; sequences with n<32 break libxml2, and rightly 
so. HTML::Entitites should not generate these codes, as they are not valid 
HTML or XML.

Jonas
Received on Fri Jul 30 02:20:18 2004