Skip to main content.
home | support | download

Back to List Archive

Re: Combining stem/non stem removing dups in perl

From: Peter Karman <karman(at)not-real.cray.com>
Date: Thu Nov 04 2004 - 15:53:17 GMT
Brad Miele wrote on 11/03/2004 04:59 PM:
> Peter,
> 
> pretty much my approach, my only desire would be to get an accurate total
> hits back. the way it is now, i have to either bring the entire result set
> in, uniq it, get the count, and then loop out the rows that i
> want. which i am feering for the size, but maybe it isn't as much of
> an issue as i think.

it all depends on the number of results you're dealing with. if you're 
'paging' results (e.g., 1-20, 21-40, etc.) then it's not that big a 
deal, in my experience.

I actually build a hash using the swish::api and return it from a 
routine, setting things like 'hits' and 'data' and 'query' as hash keys. 
that way I have control over all those values. I snip the 
swishdescription around the query in context, recalculate hits based on 
what I *actually* return, etc. let me know and I'll post an example.


> 
> along these lines, and since you were the sucker, err kind soul who
> responded first, do you know if there is a way to force a meta for every
> record based on config?
> 
> basically, if i am indexing /xmldocs, once with stem and once without, i
> 
> would like to set a sort value of 0 for the non-stemmed and 1 for the
> stemmed, so when i sorted the results, the stemmed would get pushed to the
> end of the set. I am thinking that i can do it as a meta that i ignore in
> one of the confs, but i would rather not have it in the file... possible
> use a file attribute that i omit from one index?

Not sure that I'm following you exactly. You want to set a value in each 
file that you index, but the value is not IN the file, but in the 
config? No way that I know of.

But -S prog lets you do whatever you want.

Does this describe what you want to do? If so, then a little easy perl 
script could add the 'stemmed="X"' attribute for you on the fly and you 
could index with -S prog.

karman@topaz08 175% cat c
PropertyNames foo.stemmed
UndefinedXMLAttributes auto
IndexContents XML2 .xml
karman@topaz08 176% cat test.xml
<foo stemmed="1">
   <bar>some text</bar>
</foo>
karman@topaz08 177% cat test2.xml
<foo stemmed="0">
   <bar>some text</bar>
</foo>
karman@topaz08 178% swish-e -c c -i test.xml test2.xml
Indexing Data Source: "File-System"
Indexing "test.xml"
**Adding automatic MetaName 'foo.stemmed' found in file 'test.xml'
Indexing "test2.xml"
.....

karman@topaz08 179% swish-e -w text -s foo.stemmed
# SWISH format: 2.4.1
# Search words: text
# Removed stopwords:
# Number of hits: 2
# Search time: 0.101 seconds
# Run time: 0.145 seconds
1000 test2.xml "test2.xml" 48
1000 test.xml "test.xml" 48
.


-- 
Peter Karman . http://www.cray.com/craydoc/ . karman(at)not-real.cray.com
"I love deadlines. I love the whooshing sound they make as they go by."
         - Douglas Adams
Received on Thu Nov 4 07:53:18 2004