Skip to main content.
home | support | download

Back to List Archive

Re: How to disable XML indexing of the title field

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Jan 30 2003 - 01:25:43 GMT
On Wed, 29 Jan 2003, Tref Gare wrote:

Hi Tref,

[Yes, email was backed up on the mail server today]

> We've got a bunch of xml files which document events listed on a
> website.  Each xml file includes multiple <title> fields for each event
> appearing in the navigation as well as an <eventTitle> for the actual
> hero event.
> 
> The issue is that swish-e seems to be indexing all the title fields by
> default and therefore returning search hits for events that only appear
> in the nav.  We want to index only the hero event data.

The list strips off attachements so I didn't see your config file.  So I'm
not really clear on what's not working -- I don't see a "hero" tag below.

Swish only does special indexing of HTML titles -- it flags the <title>
words as a title words (for ranking) and adds the words to swishdefault
meta.

For xml parsing that doesn't happen, but everything is indexed as
swishdefault (by default).

The same parser is used for XML and HTML parsing.  The difference is that
on tag events some special processing is done with HTML tags, namely that
words are flagged if they are within some tags (e.g. in <title>, <b>,
<em> and so on) and <title> contents are indexed as swishdefault, as
mentioned above.

So, in short, swish gets told when a new tag is found, and then decides
what to do based on if that tag is a set as a ProperyName or MetaName or
other directive.

Now, you can tell swish-e to ignore *everything* inside a <title> tag:

  IgnoreMetaTags title

That will even igonre tags that are defined as MetaNames or PropertyNames
within the tag listed in IgnoreMetaTags.  It really means ignore.

You can also use the UndefinedMetaTags to not index tags that are not
defined by setting UndefinedMetaTags to "ignore".  But, that works best on
HTML files and <meta> tags because in XML everything is a tag.  Plus, as I
just mentioned, with both IgnoreMetaTags and UndefinedMetaTags if a tag is
ignored then everything inside that tag is ignored.  So, in your example
below, if <page> is not listed then nothing will be indexed.

What I think should be done is to make IgnoreMetaTags ignore all content
within that tag, but *if* a tag is listed in MetaNames or PropertyNames
then that should allow indexing.

It's also been suggested that "content" here

 <foo>
    <bar>
       content
    </bar>
 </foo>

should be indexed as the meta name "foo.bar" instead of just "bar".


I didn't answer your question, probably.  If you have complex XML parsing
needs you may want to use -S prog and a program to parse out the text from
the xml as you like.  It's hard to have a few config directives in swish
that can deal with any structure of an xml file.

> 
> <page>
> <navigation level="3">
>     <item><title><![CDATA[iron helmets, smoking guns]]></title>
>           <url><![CDATA[/B63D051CFD6D43BD8195BE881E1DA2E0.xml]]></url>
>     </item>
>     <item><title><![CDATA[the lion king]]></title>
>           <url><![CDATA[/C0D20B3637E541B98DE1273D279C7F84.xml]]></url>
>     </item>
> </navigation>
> <content>
>     <navigation level="4"></navigation>
>     <event version="1.0" 
>         id="F551B3166E6E42AB" eventID="" ticketingEventID=""
>         htmlLocation="" compoundType="activity stream">
>             <eventID></eventID>
>             <ticketingEventID></ticketingEventID>
>             <eventTitle>closer</eventTitle>
>             <eventThumbnail></eventThumbnail>
>             <title>closer</title>
>             <subtitle></subtitle>
>             <oneLiner>Aance company Chunky Move.</oneLiner>
>             <paragraph> he moves of an on-screen dancer. Come
> closer.</paragraph>
>             <eventTypes><eventType>Exhibition</eventType></eventTypes>
>             <dates>Friday 6 December 2002- Monday 27 January
> 2003</dates>
>             <times>Daily 10am</times>
>  
> <fullText>&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&amp;nbsp;&lt;/P&gt;</full
> Text>
>             <soldOutText></soldOutText>
>             <sponsorsPartners></sponsorsPartners>
>             <datesList>
>                 <interval>
>                     <startDate>2002-12-06</startDate>
>                     <endDate>2005-01-27</endDate>   
>                 </interval>
>             </datesList>
>     </event>
>  </content>
> </page>
> 
> Many thanks in advance.
> 
> Tref
> 
> ------------------------------------------------------
> Tref Gare
> Development Consultant
> Areeba
> Level 19/114 William St, Melbourne VIC 3000
> email: trefg@areeba.com.au
> phone: +61 3 9642 5553
> fax: +61 3 9642 1335
> website: http://www.areeba.com.au
> ------------------------------------------------------
> "This email is intended only for the use of the individual or entity
> named above and contains information that is confidential. No
> confidentiality is waived or lost by any mis-transmission. If you
> received this correspondence in error, please notify the sender and
> immediately delete it from your system. You must not disclose, copy or
> rely on any part of this correspondence if you are not the intended
> recipient. Any communication directed to clients via this message is
> subject to our Agreement and relevant Project Schedule. Any information
> that is transmitted via email which may offend may have been sent
> without knowledge or the consent of Areeba."
> ------------------------------------------------------
> 
> 
> 
> 
> *********************************************************************
> Due to deletion of content types excluded from this list by policy,
> this multipart message was reduced to a single part, and from there
> to a plain text message.
> *********************************************************************
> 

-- 
Bill Moseley moseley@hank.org
Received on Thu Jan 30 01:26:11 2003