Skip to main content.
home | support | download

Back to List Archive

Re: searching XML documents

From: Peter Karman <karman(at)not-real.cray.com>
Date: Thu Apr 01 2004 - 19:52:44 GMT
Best to copy the list on these things, for the sake of those who come 
after us.

swishe supposedly wrote on 04/01/2004 01:12 PM:

>>>(1) How can we limit search results to records having date >= <datestring>
>>
>>Does the -L option help at all? It's listed as experimental, but the 
>>docs suggest that a range of dates is the intended function.
> 
> 
> Thank you very much - it worked even if it's not as fast as all the 
> other searches we've tested. But I suppose that you have to do something
> like a "full table scan".

glad that worked.

> 
> 
>>>(2) How can we search combinations of attributes in one subrecord?
>>>   e.g. -w attribute1=A attribute2=B
>>>   In our tests swish-e also finds record 2 but we only want to get 
>>>   record 1
>>>   cause only there "A / B" is found in one subrecord.
>>>   
>>

>>You might try using the -S prog method to split up your subrecords into 
>>actual, distinct xml "files". That way each one would be a distinct 
>>"file" and could be return that way.
> 
> 
> OK, but what I really want to do is - speaken in SQL - a join
> of two different types of records (or entities).
> My first entity (record) may have n subrecords (entity 2) - 1:n.
> I'm looking for a way to select records matching a combination
> of record-attributes joined with subrecords matching a combination of 
> subrecord attributes (combination means boolean AND).
> Example:
> - records are book titles
> - subrecords contain information about books of a specific library
>   e.g. signature, location, field of research (chemistry, physics, 
>   comp. science etc.)
> Now I want to find books using title keywords or author names etc. 
> for a specific field of research at a specific location.


If I'm understand you correctly, I think you have to do two things: make 
your XML more descriptive (unique) and perhaps manipulate the results 
after you have them. There's no sense of "different types of records" in 
a single swish index. You could, I suppose, create multiple indexes of 
different kinds (titles.index and info.index) and then merge the 
properties back together to form a virtual table.

But I think you'd probably be better off doing that with a real SQL 
database. swish is really good at indexing and searching text, but it 
assumes relationships between data are consistent through a single set 
of documents.

your example---

configuration:
    IndexDir .
    IndexOnly .xml
    IndexContents XML2 .xml
    UndefinedMetaTags auto
    UndefinedXMLAttributes auto
    PropertyNames date attribute1 attribute2

2 xml records:

<record>
    <id>1</id>
    <date>20040213</date>
    ... more record elements
    <subrecord>
       <attribute1> A </attribute1>
       <attribute2> B </attribute2>
    <subrecord>
    <subrecord>
       <attribute1> C </attribute1>
       <attribute2> D </attribute2>
    <subrecord>
</record>

<record>
    <id>2</id>
    <date>20040115</date>
    ... more record elements
    <subrecord>
       <attribute1> A </attribute1>
       <attribute2> D </attribute2>
    <subrecord>
    <subrecord>
       <attribute1> C </attribute1>
       <attribute2> B </attribute2>
    <subrecord>
</record>

--end your example

The docs say this (under PropertyNames):

If Swish-e finds more than one property of the same name in a document 
the property's contents will be concatinated for strings, and a warning 
issues for numeric (or date) properties.

I understand that as, according to your example XML, if two attribute1 
tags appear in a document, there contents are captured together in a 
single property. That makes them virtually useless to you.

If I had those XML records as you describe, I might try flipping them 
inside out so as to make them smaller and thus more unique.
Take your current <record> number two and split it into two:

<subrecord>
<id>2</id>
<date>20040115</date>
<attribute1> C </attribute1>
<attribute2> B </attribute2>
</subrecord>

<subrecord>
<id>2</id>
<date>20040115</date>
<attribute1> A </attribute1>
<attribute2> D </attribute2>
</subrecord>

that might let you manipulate a little more with swish.

pek
-- 
Peter Karman - Software Publications Programmer - Cray Inc
phone: 651-605-9009 - mailto:karman@cray.com
Received on Thu Apr 1 11:52:44 2004