First of all - yes, libxml2 aborts processing, and thus only the part of
the file preceding the "strange" character is indexed, which is my
problem.
Second of all, I have a little example.
I'm using the prog option to run through html_entities, and here's the
code for index.pl:
#!/usr/bin/perl -w
use strict;
use HTML::Entities();
$/ = undef;
open (FH, "test.txt") or die $!;
my $contents = <FH>;
close (FH);
my $title = $_;
my $type;
$contents = "<xml>\n" . HTML::Entities::encode_entities($contents) .
"\n</xml>";
my $size = length $contents;
print "Path-Name: test.txt\nContent-Length: $size\nDocument-Type:
XML*\n\n".$contents;
1;
Then, there's my little test file test.txt which contains some ASCII
control characters (I'm not sure whether this will be sent properly, but
if not, just create a file with some ASCII characters below 32):
???? " R ?
Then, to get libxml2 error messages, we need a index.cfg (is there any way
to do this from the command line?):
ParserWarnLevel 3
Then, run swish-e:
# perl -w index.pl | swish-e -S prog -i stdin -v 3 -c index.cfg
Parsing config file 'index.cfg'
Indexing Data Source: "External-Program"
Indexing "stdin"
test.txt - Using XML2 parser - test.txt:2: error: xmlParseCharRef: invalid
xmlChar value 2
  ???? " R  ?
^
(no words indexed)
Removing very common words...
no words removed.
Writing main index...
err: No unique words indexed!
.
As you can see, the &#n; sequences with n<32 break libxml2, and rightly
so. HTML::Entitites should not generate these codes, as they are not valid
HTML or XML.
Jonas
Received on Fri Jul 30 02:20:18 2004