At 08:41 PM 10/03/02 -0700, Roy Tennant wrote:
>Sorry, I should have known better. And I realize from your answer that
>I'm in big trouble. I have books that are contained entirely within
>this tag:
>
><TEI.2 id="ark:/13030/ft2p30058m" bnum="bn5464">
>stuff here
></TEI.2>
I think you are in big trouble.
Look at this:
~/swish-e/src > cat c
defaultcontents XML2 .xml
UndefinedXMLAttributes auto
~/swish-e/src > cat 1.xml
<?xml version="1.0"?>
<page>
<TEI.2 id="ark:/13030/ft2p30058m" bnum="bn5464">
stuff here
</TEI.2>
</page>
~/swish-e/src > ./swish-e -c c -i 1.xml -T indexed_words -v0
Adding:[1:tei.2.id(10)] 'ark' Pos:4 Stuct:0x1 ( FILE )
Adding:[1:tei.2.id(10)] '13030' Pos:5 Stuct:0x1 ( FILE )
Adding:[1:tei.2.id(10)] 'ft2p30058m' Pos:6 Stuct:0x1 ( FILE )
Adding:[1:tei.2.bnum(11)] 'bn5464' Pos:9 Stuct:0x1 ( FILE )
Adding:[1:swishdefault(1)] 'stuff' Pos:13 Stuct:0x1 ( FILE )
Adding:[1:swishdefault(1)] 'here' Pos:14 Stuct:0x1 ( FILE )
I'm not exactly sure why it's called UndefinedXMLAttributes. But that
still indexes the content of the tag. I mentioned this the other day, but
it would be nice if you could say:
IgnoreMetaTags TEI.2
and avoid indexing that content -- but since the attributes are within that
tag they are all ignored. Too many ways to parse xml, I fear. Maybe we
can figure something better for the next release...
My suggestion would be use one of the CPAN XML parsers and pull out the
attribures you want indexed.
--
Bill Moseley
mailto:moseley@hank.org
Received on Fri Oct 4 04:03:30 2002