Skip to main content.
home | support | download

Back to List Archive

Re: Combining stem/non stem removing dups in perl

From: Brad Miele <brad(at)not-real.auroraquanta.com>
Date: Thu Nov 04 2004 - 17:09:32 GMT
Ahhh, that makes sense, i will just have to play with the sizes of
results to see if there is any issue. Just out of curiosity, how many
properties do you return for display? I am doing some caption and other
text values. Which is part of what scared me. I didn't know if a resultset
of say 10,000 records(combined stem and non), would be cumbersome to
manipulate.

And I came to the conclusion that i should switch back to -S prog as you
mention. I had moved away to it when streaming from a db select was too
long, but a filesystem middleman will probably be fine.

Thanks for all of the help.

Brad
------------------------------------------------------------
 Brad Miele
 Technology Director
 AuroraPhotos.com
 (207) 828-8787 x110
 bmiele@auroraphotos.com

 A mathematician is a machine for converting coffee into theorems.


On Thu, 4 Nov 2004, Peter Karman wrote:

>
>
> Brad Miele wrote on 11/03/2004 04:59 PM:
> > Peter,
> >
> > pretty much my approach, my only desire would be to get an accurate total
> > hits back. the way it is now, i have to either bring the entire result set
> > in, uniq it, get the count, and then loop out the rows that i
> > want. which i am feering for the size, but maybe it isn't as much of
> > an issue as i think.
>
> it all depends on the number of results you're dealing with. if you're
> 'paging' results (e.g., 1-20, 21-40, etc.) then it's not that big a
> deal, in my experience.
>
> I actually build a hash using the swish::api and return it from a
> routine, setting things like 'hits' and 'data' and 'query' as hash keys.
> that way I have control over all those values. I snip the
> swishdescription around the query in context, recalculate hits based on
> what I *actually* return, etc. let me know and I'll post an example.
>
>
> >
> > along these lines, and since you were the sucker, err kind soul who
> > responded first, do you know if there is a way to force a meta for every
> > record based on config?
> >
> > basically, if i am indexing /xmldocs, once with stem and once without, i
> >
> > would like to set a sort value of 0 for the non-stemmed and 1 for the
> > stemmed, so when i sorted the results, the stemmed would get pushed to the
> > end of the set. I am thinking that i can do it as a meta that i ignore in
> > one of the confs, but i would rather not have it in the file... possible
> > use a file attribute that i omit from one index?
>
> Not sure that I'm following you exactly. You want to set a value in each
> file that you index, but the value is not IN the file, but in the
> config? No way that I know of.
>
> But -S prog lets you do whatever you want.
>
> Does this describe what you want to do? If so, then a little easy perl
> script could add the 'stemmed="X"' attribute for you on the fly and you
> could index with -S prog.
>
> karman@topaz08 175% cat c
> PropertyNames foo.stemmed
> UndefinedXMLAttributes auto
> IndexContents XML2 .xml
> karman@topaz08 176% cat test.xml
> <foo stemmed="1">
>    <bar>some text</bar>
> </foo>
> karman@topaz08 177% cat test2.xml
> <foo stemmed="0">
>    <bar>some text</bar>
> </foo>
> karman@topaz08 178% swish-e -c c -i test.xml test2.xml
> Indexing Data Source: "File-System"
> Indexing "test.xml"
> **Adding automatic MetaName 'foo.stemmed' found in file 'test.xml'
> Indexing "test2.xml"
> ......
>
> karman@topaz08 179% swish-e -w text -s foo.stemmed
> # SWISH format: 2.4.1
> # Search words: text
> # Removed stopwords:
> # Number of hits: 2
> # Search time: 0.101 seconds
> # Run time: 0.145 seconds
> 1000 test2.xml "test2.xml" 48
> 1000 test.xml "test.xml" 48
> .
>
>
> --
> Peter Karman . http://www.cray.com/craydoc/ . karman(at)not-real.cray.com
> "I love deadlines. I love the whooshing sound they make as they go by."
>          - Douglas Adams
>
Received on Thu Nov 4 09:09:33 2004