Skip to main content.
home | support | download

Back to List Archive

Re: Win32 PHRASE search

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Apr 25 2000 - 14:39:12 GMT
At 10:43 AM 04/24/00 -0700, Jose Manuel Ruiz wrote:
>More about stop words...
>
>In config.h you can find the following line:
>
>#define IGNORE_STOPWORDS_IN_QUERY 1

Oh, I wonder if there isn't a problem with the SRE's change in that routine
in search.c?  I'm getting a segfault:

For example, a document that contains this:

   "...with a searchable database of over 5,000 recipes..."

Searching for (non-phrase search) 

> ../swish-e -w 'keywords=(database of over)'

I'm not very good at gbd, but here's a back trace:

(gdb) run -w 'keywords=(database of over)'
Starting program: swish-e -w 'keywords=(database of over)'

NOTE: this is really the 'h' version, it just says 'g'.

# SWISH format 1.3.2g PHRASE
# Swish-e format 1.3.2g - PHRASE
# 
# Name: (no name)
# Saved as: index.swish-e
# Counts: 25093 words, 6053 files
# Indexed on: 25/04/2000 06:33:27 PDT
# Description: (no description)
# Pointer: (no pointer)
# Maintained by: (no maintainer)
# DocumentProperties: Enabled
# Stemming Applied: 1
# Soundex Applied: 0
# Removed stopword: of

Program received signal SIGSEGV, Segmentation fault.
chunk_free (ar_ptr=0xba057400, p=0x40131ba8) at malloc.c:2993
2993    malloc.c: No such file or directory.

(gdb) bt
#0  chunk_free (ar_ptr=0xba057400, p=0x40131ba8) at malloc.c:2993
#1  0x40099807 in __libc_free (mem=0x40131bb0) at malloc.c:2967
#2  0x805309c in efree (ptr=0x40131bb0) at mem.c:148
#3  0x804f1f4 in search (words=0xbfffeccc "keywords=(database of over)",
indexlist=0x808cbd8, structure=1) at search.c:331
#4  0x8058d99 in main (argc=0, argv=0xbffff8dc) at swish.c:619
(gdb) 


>So, I am wondering if IGNORE_STOPWORDS_IN_QUERY has any sense now.
>It always has to be enabled!! 

Exactly.  I can't see any reason that should be an option.  I wonder if
that was added as a step to fix the broken search.c logic.

>> Plus, I really think that swish should parse text on searching exactly like
>> it does on indexing.  Otherwise, it is very confusing as you can't search
>> for text cut directly from the source document and expect it to work.  That
>> means the wordchars, ignore first and last, and other settings would need
>> to be saved in the index file (just like the Use Stemming: setting).
>>
>
>Yes, it should work that way. But this can be a major change. Let me
>look at the code... There are other things that may be also included in the
>index file. 

I had that working at one point -- well kind of working -- but my C skills
are not good enough to really get it right.  But it was a hack.

The problems I had were I had to read the swish.conf file on every search
to get Wordcharacters and other settings used to determine what defines a
"swish" word.  The solution is to put those settings in the index file header.

The second problem I had was with multiple indexes.  I needed to rewrite
the logic so the parsing of query words was done on a per-index file basis
instead of just at the start of search.c.  This is because different index
files could have different settings used to define "swish" words.

Frankly, the entire search.c parsing always has bugged me.  It's full of
hacks now that look for wild cards, or make exceptions if a meta tag name
is found.  Seems like the query needs to be somehow parsed into a better
syntax tree, but that's way above my skills.

>> I don't think that phrases should span meta fields, either.  It seemed like
>> I could search for "two words" and if "two" was the last word in one meta
>> field, and "words" was the first word in the next field it would find a
>> match.  That shouldn't be like that.
>> 
>
>Yes, you are right. In function parseMetaData the position
>is always set to 1 each time it is called. Perhaps there is a bug.
>Anyway, I will check it.

No, sorry.  I think that's my bug.

But I need some ideas on how to solve this problem:

Say I have three meta fields: "title", "description", and "subject".

I concatenate the three into one field "keywords".  This means I can use
swish to search any single field, or, by using "keywords" I can search all
fields at once (as in my gdb example above).  But that has the problem that
a phrase can span meta fields when searching "keywords".

One ugly solution would be for me to add some non-word when concatenating
into one field so phrase would never span fields. 

Or, I could change my queries to look like this:

  -w 'title=(database of over) or description=(database of over)
         or subject=(database of over)'

But that ends up being three searches and a bit slower especially if the
query is complex (e.g. with wild cards).

I wonder how hard it would be to expand the query syntax so I could say:

  -w 'title,description,subject=(database of over)'

so swish would only have to read the index one time, yet check for the
words or phrase within each meta tagged field.

Any ideas?

Thanks,



Bill Moseley
mailto:moseley@hank.org
Received on Tue Apr 25 10:41:16 2000