Skip to main content.
home | support | download

Back to List Archive

difference in XML2 vs HTML2 ?

From: Peter Karman <karman(at)not-real.cray.com>
Date: Tue Feb 03 2004 - 06:03:50 GMT
Forgive me, please, if someone else has pointed this out or if this is a 
known issue. I don't recall seeing this in the archive. This is also a 
bit long, but I tried to include the relevant examples.

I have a test doc. I parse it with libxml2. If I specifically tell 
swish-e to use the XML2 parser, I get different results than if I let it 
default to HTML2.

The difference seems to be that the XML2 version splits words on tags, 
while the HTML2 parser does not. The result? In the example below, if a 
user searches for:

-h[option]

and the files have used been indexed with XML2, they won't get a hit. 
But if the files have been indexed with HTML2, they do.

I guess my question is: should the HTML and XML versions really act so 
differently? I know the obvious answer is "use HTML2 for HTML docs, and 
vice versa" but my concern is that this spacing issue may throw off my 
indexing of XML docs, since (as in this example) the same search has two 
different results, depending on the source format.

I googled for this and found

http://aspn.activestate.com/ASPN/Mail/Message/perl-xml/1777108

which leads me to believe that parsers work the same way.

I also found this gem:

http://mail.gnome.org/archives/xml/2001-September/msg00118.html

which leads me to believe that Bill has dealt with this already and has 
something authoritative to say. ;)

I looked at parser.c and it looks like there are two different functions 
called, one each for HTML and XML (htmlCreatePushParserCtxt and 
xmlCreatePushParserCtxt) -- does this mean the issue is with libxml2 and 
I should just suck it up and use some kind of preprocessor to strip out 
the inline tags? I am using libxml2 2.6.4.

====================================

karpet@cartermac 212% cat config
WordCharacters 
0123456789abcdefghijklmnopqrstuvwxyz._-/#()+{}[]%!&=$;:'<>?\|@^
BeginCharacters 
0123456789abcdefghijklmnopqrstuvwxyz._-/#()+{}[]%!&=$;:'<>?\|@^
EndCharacters 0123456789abcdefghijklmnopqrstuvwxyz_-/#()+{}[]%!&=$;:'<>?\|@^
MinWordLimit 1
#IndexContents XML* .xml .html


karpet@cartermac 213% cat test.html
<html>
<a href="some/link.html">testing 123</a>
&#045;h<tt class="literal">[access]</tt> paramto_the_option

<tt CLASS="literal">-h [<span CLASS="optional">yes</span>]aggress</tt>
<tt CLASS="literal">-h <span CLASS="optional">[no]</span>aggress</tt>

</html>


karpet@cartermac 214% swish-e -i test.html -T PARSED_WORDS -c config -v 3
Parsing config file 'config'
Indexing Data Source: "File-System"
Indexing "test.html"

Checking file "test.html"...
   test.html - Using DEFAULT (HTML2) parser - White-space found word 
'testing'
White-space found word '123'
White-space found word '-h[access]'
White-space found word 'paramto_the_option'
White-space found word '-h'
White-space found word '[yes]aggress'
White-space found word '-h'
White-space found word '[no]aggress'
  (8 words)


karpet@cartermac 216% swish-e -i test.html -T PARSED_WORDS -c config -v 3
Parsing config file 'config'
Indexing Data Source: "File-System"
Indexing "test.html"

Checking file "test.html"...
   test.html - Using XML2 parser - White-space found word 'testing'
White-space found word '123'
White-space found word '-h'
White-space found word '[access]'
White-space found word 'paramto_the_option'
White-space found word '-h'
White-space found word '['
White-space found word 'yes'
White-space found word ']aggress'
White-space found word '-h'
White-space found word '[no]'
White-space found word 'aggress'
  (12 words)


========================================================

Moreover, if I use non-HTML tags in my test doc, and the HTML2 parser is 
used, I get still different results. libxml2 does indeed seem to parse 
HTML against the HTML DTD:

karpet@cartermac 239% xmllint test.html
<?xml version="1.0"?>
<html>
<a href="some/link.html">testing 123</a>
-h<tt class="literal">[access]</tt> paramto_the_option

<tt CLASS="literal">-h [<span CLASS="optional">yes</span>]aggress</tt>
<notag CLASS="literal">-h <foo CLASS="optional">[no]</foo>aggress</notag>

</html>
karpet@cartermac 240% xmllint --html test.html
test.html:6: HTML parser error : Tag notag invalid
<notag CLASS="literal">-h <foo CLASS="optional">[no]</foo>aggress</notag>
                       ^
test.html:6: HTML parser error : Tag foo invalid
<notag CLASS="literal">-h <foo CLASS="optional">[no]</foo>aggress</notag>
                                                ^
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" 
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<a href="some/link.html">testing 123</a><p>
-h<tt class="literal">[access]</tt> paramto_the_option

<tt class="literal">-h [<span class="optional">yes</span>]aggress</tt>
<notag class="literal">-h <foo 
class="optional">[no]</foo>aggress</notag></p>
</body></html>

===================

and here's what swish-e gives me (note that swish-e seems to see one 
more word when non-HTML tags are used...):

karpet@cartermac % cat test.html
<html>
<a href="some/link.html">testing 123</a>
&#045;h<tt class="literal">[access]</tt> paramto_the_option

<tt CLASS="literal">-h [<span CLASS="optional">yes</span>]aggress</tt>
<bar CLASS="literal">-h <foo CLASS="optional">[no]</foo>aggress</bar>

</html>

karpet@cartermac % swish-e -i test.html -T PARSED_WORDS -c config -v 3
Checking file "test.html"...
   test.html - Using DEFAULT (HTML2) parser - White-space found word 
'testing'
White-space found word '123'
White-space found word '-h[access]'
White-space found word 'paramto_the_option'
White-space found word '-h'
White-space found word '[yes]aggress'
White-space found word '-h'
White-space found word '[no]'
White-space found word 'aggress'
  (9 words)


-- 
Peter Karman - Software Publications Engineer - Cray Inc
phone: 651-605-9009 - mailto:karman@cray.com
Received on Mon Feb 2 22:03:59 2004