Skip to main content.
home | support | download

Back to List Archive

Re: Proximity searches

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Aug 20 2002 - 16:31:04 GMT
At 08:28 AM 08/20/02 -0700, Clem McDonald wrote:
>My questions are:
>    a) Does the index now contain a word distance from the start of a
>document?

No, it's more of a position counter -- and there's only one counter (i.e.
not one for each metaname).  It's bumped for each word, then it's bumped
when a meta name if found (to prevent matching phrases across metanames),
and there's also a config option to bump it on a set of characters.  This
allows doing something like:

   <meta name="subjects" content="some subject|another thing">

and not matching on the phrase "subject another"

>  b)  Does the phrase search represent a kind of proximity search as I
>have assumed?

(answering without looking at the code...)

Only in that it checks that a word's position in a given file is only one
more then the previous word.  I don't see why you couldn't say within, say,
five word positions -- but it might not be exactly five word positions due
to things mentioned above and stop words.

>  c) Is there now something in the works for defining proximity in terms
>of word counts?

I don't know.  Is there now?  ;)

I don't know how valuable proximity searching would be.  I think I'd rather
see work on the ranking so that if you search for a few words (not a
phrase) that docs where those words are close together are ranked higher
than words that are not close together.

>So you could not tell the difference between the two cases when looking
>for Hello world.
>
> Document 1
>    <tag a> Hello  World < tag a/>
>
>
>Document 2
>   <tag a> hello <tag a/>
>    <tag a> world <tag /a>

Right, searching for the (non-phrase) "hello world" would find both.

>This is important to our application where we may have diagnoses made up
>of an anatomic part and and a pathology part e.g.
>Document 1
>  <diagnosis> Prostate benign < diagnosis>
> < diagnosis > Colon cancer < diagnosis>
>

I'm not completely following.  

<?xml version="1.0"?>
<all>
  <diagnosis> Prostate benign <diagnosis>
  <diagnosis> Colon cancer <diagnosis>
</all>

> ./swish-e -c c -i 1.xml -T indexed_words -v0
    Adding:[1:diagnosis(10)]   'prostate'   Pos:4  Stuct:0x1 ( FILE )
    Adding:[1:diagnosis(10)]   'benign'   Pos:5  Stuct:0x1 ( FILE )
    Adding:[1:diagnosis(10)]   'colon'   Pos:8  Stuct:0x1 ( FILE )
    Adding:[1:diagnosis(10)]   'cancer'   Pos:9  Stuct:0x1 ( FILE )

So you can see that they are all part of the same metaname "diagnosis", but
you can also see how the word positions are bumped (twice -- once for
ending tag, and again for start tag) to prevent matching the phrase "benign
colon" (thank goodness!).

>Think SWISH-E would find  this document contains Prostate cancer>

> ./swish-e -w 'diagnosis=(prostate cancer)' -H0
1000 1.xml "1.xml" 116

Yes, it will find it.  Is that what you want?

If not, then you would need to modify the docs a bit.

Swish-e has a magic bullet called -S prog.  That means you can index any
document corpus you like since you can parse and reformat docs on-the-fly
while indexing to make them fit into how you want to search them.


-- 
Bill Moseley
mailto:moseley@hank.org
Received on Tue Aug 20 16:34:32 2002