Skip to main content.
home | support | download

Back to List Archive

Re: Relative Newbie Swish-e indexing query

From: Tref Gare <TrefG(at)not-real.areeba.com.au>
Date: Fri Nov 22 2002 - 01:35:32 GMT
Thanks Bill,

You've got my intentions mostly right there.

I'm trying to index a variety of elements and possibly attributes of the
xml files.

I then need to be able to search the index for texts and/or dates in
those specific fields.

Do I understand you right that this isn't quite in swish-e's scope, and
if so is it the attribute stuff that is stretching the envelope or the
indexing of the values.
I have some control over the xml's design and could change most things
into elements if that would bring the app back into swish-e's world, as
it's basically there for this searching functionality.

As such I've made some adjustments to the xml such that it appears like
this

<event id="1341341234">
	
<htmlLocation>http://blesdfs.sdlfsf.sflsdf/sfdfs.htm</htmlLocation
	<eventTitle>Run Lola Run</eventTitle>
	<oneLiner>it's a goodun</oneLiner>
	<description>no a.. really really goodun</description>
	... other stuff
	<interval>
		<startDate>22/11/2002</startDate>
		<endDate>24/11/2002</endDate>
	</interval>
	<interval>
		<startDate>25/11/2002</startDate>
	</interval>
</event>

To extract which I'm adding the following to my config file

MetaNames oneLiner eventTitle htmlLocation startDate endDate
PropertyNames oneLiner htmlLocation startDate endDate eventTitle


All this seems to make sense to me however I'm still not getting the
fields back field back (except strangely enough for the oneLiner
element).

------------------------------------------------------
Tref Gare
Development Consultant
Areeba
Level 19/114 William St, Melbourne VIC 3000
email: trefg@areeba.com.au
phone: +61 3 9642 5553
fax: +61 3 9642 1335
website: http://www.areeba.com.au
------------------------------------------------------
"This email is intended only for the use of the individual or entity
named above and contains information that is confidential. No
confidentiality is waived or lost by any mis-transmission. If you
received this correspondence in error, please notify the sender and
immediately delete it from your system. You must not disclose, copy or
rely on any part of this correspondence if you are not the intended
recipient. Any communication directed to clients via this message is
subject to our Agreement and relevant Project Schedule. Any information
that is transmitted via email which may offend may have been sent
without knowledge or the consent of Areeba."
------------------------------------------------------

-----Original Message-----
From: Bill Moseley [mailto:moseley@hank.org] 
Sent: Friday, 22 November 2002 12:02 PM
To: Multiple recipients of list
Subject: [SWISH-E] Re: Relative Newbie Swish-e indexing query

At 04:37 PM 11/21/02 -0800, Tref Gare wrote:
>I'm trying to define them as attributes via the following lines in my
>swish.config
>
>MetaNames description event keywords oneLiner title
>XMLClassAttributes htmlLocation startDate endDate
># adding the property names line
>PropertyNames oneLiner keywords htmlLocation startDate endDate
>
>Anyone got any thoughts as to why I can't seem to
access/reference/index
>them?

I'm not clear what you want to do.

XMLClassAttributes does this:

<event version="1.0">
    <datesList>
        <interval startDate="2002-11-21">
           firstone
        </interval>
    
        <interval startDate="2002-11-19">
           secondone
        </interval>
    </datesList>
</event>

> cat t
IndexContents XML2 .xml
XMLClassAttributes startDate

> ./swish-e -c t -i 2.xml -T parsed_tags  -v0                         
<event> (undefined meta name - no action)
<dateslist> (undefined meta name - no action)
<interval> (undefined meta name - no action)
<interval.2002-11-21> (undefined meta name - no action)
<interval> (undefined meta name - no action)
<interval.2002-11-19> (undefined meta name - no action)

So notice it's making a tag by combining the tag <interval> with the
*value* of the startDate attribute.

So by adding this to the config:

  PropertyNames interval.2002-11-19

you get

 ./swish-e -c t -i 2.xml -T properties  -v0            
          swishdocpath: 6 (  5) S: "2.xml"
          swishdocsize: 8 (  4) N: "235"
     swishlastmodified: 9 (  4) D: "2002-11-21 16:43:12"
   interval.2002-11-19:10 (  9) S: "secondone"

I doubt that's what you want.   Do you want to index the *value*?

This doesn't work well, but:

> cat 2.xml
<event version="1.0">
    <datesList>
        <interval startDate="2002-11-21" />
        <interval startDate="2002-11-19" />
    </datesList>
</event>

> cat t
IndexContents XML2 .xml
UndefinedXMLAttributes ignore
PropertyNames interval.startdate

> ./swish-e -c t -i 2.xml -T properties  -v0 
          swishdocpath: 6 (  5) S: "2.xml"
          swishdocsize: 8 (  4) N: "153"
     swishlastmodified: 9 (  4) D: "2002-11-21 16:52:22"
    interval.startdate:10 ( 21) S: "2002-11-21 2002-11-19"

Notice how now there's a swish-created "interval.startdate" metaname
(property in this example) which used the value from each one for the
data.

There's a bunch of weird problems with this xml parsing.  For one thing
it's hard to index just some deeply nested content only.  That's because
if
an outside tag is ignored then the inner tag is not seen.

Also, I think it's sometimes hard to convert the nested xml structure
into
flattened metanames that swish-e uses.  XML gives a flexible way to
represent data, and that doesn't always map into a nice few config
options
for swish-e.

If you have complex xml data where you only want to index specific parts
than it's probably smart to use -S prog and an XML SAX or DOM parser and
extract out the specific data you like.


-- 
Bill Moseley
mailto:moseley@hank.org
Received on Fri Nov 22 01:35:44 2002