On Tue, Oct 26, 2004 at 11:42:09PM -0700, Stein-Egil Museus wrote:
> <row><entry><para>559999</para></entry><entry><para>Some text</para></entry><row>
Anyone know what the XML spec says about this? How do you know what
are tags should split text?
With HTML some tags are block level and some are inline:
moseley@laptop:~$ cat 1.html
<html>
<head>
<body>
<div>first</div>second<b>third</b>forth<div>sixth</div>last
</body>
</html>
moseley@laptop:~$ swish-e -i 1.html -T indexed_words -v0
Adding:[1:swishdefault(1)] 'first' Pos:7 Stuct:0x9 ( BODY FILE )
Adding:[1:swishdefault(1)] 'secondthirdforth' Pos:10 Stuct:0x49 ( EM BODY FILE )
Adding:[1:swishdefault(1)] 'sixth' Pos:13 Stuct:0x49 ( EM BODY FILE )
Adding:[1:swishdefault(1)] 'last' Pos:16 Stuct:0x49 ( EM BODY FILE )
Libxml2 provides a way to tell the difference.
A quick look at src/parser.c looks like you might be able to uncomment
the "append_buffer()" call at about line 1068 if you want all the tags
to be block level.
--
Bill Moseley
moseley@hank.org
Unsubscribe from or help with the swish-e list:
http://swish-e.org/Discussion/
Help with Swish-e:
http://swish-e.org/current/docs
swish-e@sunsite.berkeley.edu
Received on Wed Oct 27 07:37:21 2004