At 08:28 AM 08/20/02 -0700, Clem McDonald wrote:
>My questions are:
> a) Does the index now contain a word distance from the start of a
>document?
No, it's more of a position counter -- and there's only one counter (i.e.
not one for each metaname). It's bumped for each word, then it's bumped
when a meta name if found (to prevent matching phrases across metanames),
and there's also a config option to bump it on a set of characters. This
allows doing something like:
<meta name="subjects" content="some subject|another thing">
and not matching on the phrase "subject another"
> b) Does the phrase search represent a kind of proximity search as I
>have assumed?
(answering without looking at the code...)
Only in that it checks that a word's position in a given file is only one
more then the previous word. I don't see why you couldn't say within, say,
five word positions -- but it might not be exactly five word positions due
to things mentioned above and stop words.
> c) Is there now something in the works for defining proximity in terms
>of word counts?
I don't know. Is there now? ;)
I don't know how valuable proximity searching would be. I think I'd rather
see work on the ranking so that if you search for a few words (not a
phrase) that docs where those words are close together are ranked higher
than words that are not close together.
>So you could not tell the difference between the two cases when looking
>for Hello world.
>
> Document 1
> <tag a> Hello World < tag a/>
>
>
>Document 2
> <tag a> hello <tag a/>
> <tag a> world <tag /a>
Right, searching for the (non-phrase) "hello world" would find both.
>This is important to our application where we may have diagnoses made up
>of an anatomic part and and a pathology part e.g.
>Document 1
> <diagnosis> Prostate benign < diagnosis>
> < diagnosis > Colon cancer < diagnosis>
>
I'm not completely following.
<?xml version="1.0"?>
<all>
<diagnosis> Prostate benign <diagnosis>
<diagnosis> Colon cancer <diagnosis>
</all>
> ./swish-e -c c -i 1.xml -T indexed_words -v0
Adding:[1:diagnosis(10)] 'prostate' Pos:4 Stuct:0x1 ( FILE )
Adding:[1:diagnosis(10)] 'benign' Pos:5 Stuct:0x1 ( FILE )
Adding:[1:diagnosis(10)] 'colon' Pos:8 Stuct:0x1 ( FILE )
Adding:[1:diagnosis(10)] 'cancer' Pos:9 Stuct:0x1 ( FILE )
So you can see that they are all part of the same metaname "diagnosis", but
you can also see how the word positions are bumped (twice -- once for
ending tag, and again for start tag) to prevent matching the phrase "benign
colon" (thank goodness!).
>Think SWISH-E would find this document contains Prostate cancer>
> ./swish-e -w 'diagnosis=(prostate cancer)' -H0
1000 1.xml "1.xml" 116
Yes, it will find it. Is that what you want?
If not, then you would need to modify the docs a bit.
Swish-e has a magic bullet called -S prog. That means you can index any
document corpus you like since you can parse and reformat docs on-the-fly
while indexing to make them fit into how you want to search them.
--
Bill Moseley
mailto:moseley@hank.org
Received on Tue Aug 20 16:34:32 2002