Skip to main content.
home | support | download

Back to List Archive

Re: period in meta name

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Oct 04 2002 - 03:59:40 GMT
At 08:41 PM 10/03/02 -0700, Roy Tennant wrote:
>Sorry, I should have known better. And I realize from your answer that 
>I'm in big trouble. I have books that are contained entirely within 
>this tag:
>
><TEI.2 id="ark:/13030/ft2p30058m" bnum="bn5464">
>stuff here
></TEI.2>

I think you are in big trouble.

Look at this:

~/swish-e/src > cat c
defaultcontents XML2 .xml
UndefinedXMLAttributes auto

~/swish-e/src > cat 1.xml
<?xml version="1.0"?>
<page>
<TEI.2 id="ark:/13030/ft2p30058m" bnum="bn5464">
stuff here
</TEI.2>
</page>

~/swish-e/src > ./swish-e -c c -i 1.xml -T indexed_words  -v0
    Adding:[1:tei.2.id(10)]   'ark'   Pos:4  Stuct:0x1 ( FILE )
    Adding:[1:tei.2.id(10)]   '13030'   Pos:5  Stuct:0x1 ( FILE )
    Adding:[1:tei.2.id(10)]   'ft2p30058m'   Pos:6  Stuct:0x1 ( FILE )
    Adding:[1:tei.2.bnum(11)]   'bn5464'   Pos:9  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'stuff'   Pos:13  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'here'   Pos:14  Stuct:0x1 ( FILE )

I'm not exactly sure why it's called UndefinedXMLAttributes.  But that
still indexes the content of the tag.  I mentioned this the other day, but
it would be nice if you could say:

   IgnoreMetaTags TEI.2

and avoid indexing that content -- but since the attributes are within that
tag they are all ignored.  Too many ways to parse xml, I fear.  Maybe we
can figure something better for the next release...

My suggestion would be use one of the CPAN XML parsers and pull out the
attribures you want indexed.

-- 
Bill Moseley
mailto:moseley@hank.org
Received on Fri Oct 4 04:03:30 2002