Skip to main content.
home | support | download

Back to List Archive

RE: Using Swish's Query Parser (to pre-filter a collection of documents)

From: Masoud Pirnazar <amp834(at)not-real.rqinc.com>
Date: Thu Nov 18 2004 - 20:35:35 GMT
i think you have the correct picture, but here's another attempt:

A:(a bunch of documents, say 500,000 docs)  |  B:(initial pre-filtering,
qualifying say 40,000 docs)  | C:(index the 40,000 qualified docs) |
D:(allow users to search the 40,000 qualified docs)

(using the pipe sign | here to indicate the flow of data/different stages of
processing)

the end user specifies the criteria in steps B and D.  it would be easier
for the end user to use the same query syntax in both steps.  at step B, it
filters out a lot of unwanted documents.  at step D, they are searching
using other criteria, so the query changes.

a typical application:  fromthe 500,000 docs, i want to extract only the
40,000 docs that mention some kind of sport activity, then put those in the
"sports collection" and allow end users to search the sports collection
using whatever (unrelated) queries they want to use.

i think i would be re-creating the parser and also make a stream-oriented
search routine, but would rather avoid doing all this if there is another
way of solving this problem.


-----Original Message-----
From: Peter Karman [mailto:karpet@peknet.com]
Sent: Thursday, November 18, 2004 2:37 PM
To: Masoud Pirnazar; Multiple recipients of list
Subject: Re: [SWISH-E] Using Swish's Query Parser



Masoud Pirnazar wrote on 11/18/2004 09:27 AM:
> the "run a swish-format query" part is the part that's undefined right
> now--i'd like to keep the same syntax as swish, since once the documents
> pass through the initial query filter, they will be added to a collection,
> and the user will use the swish syntax to search this collection later.
>
> the simplest plan is:  extract and index everything, run the filter query,
> extract just the qualifying documents again and add them to the main
> collection.  this is a little disk-space expensive (timewise, it will
> probably be ok).
>


What I'm not clear on is why you need to form a swish-e query that you
don't intend to use with swish-e.

It sounds like you're doing two indexing passes: once for everything,
then another for docs that match a certain query.

i.e.

swish-e -i /path/to/docs
swish-e -w myquery -x '<swishdocpath>\n' > list

and then use that list to make another index.

It would be nice if the IndexDir config option would take a file as an
argument, because then you could do:

config:
IndexDir list

and

swish-e -c config

 > ideally, if there were routines in swish such as
 > CompileQuery(strQuery) return CompiledQuery
 > TextMatchesQuery(strText, CompiledQuery) returns true/false (or info
about
 > the matches)
 >
 > then i wouldn't have to re-create a parser for the query syntax.
 >

The SWISH::API ParsedWords() function will return the query as swish
parsed it. Is that what you need?

http://www.swish-e.org/current/docs/API.html



> (by the way, i'm not sure if i should respond to your email address or
send
> my response to the listproc again.  can you let me know?)
>

the list. that way Q&A are searchable.


> thanks
>
> -----Original Message-----
> From: Peter Karman [mailto:karpet@peknet.com]
> Sent: Wednesday, November 17, 2004 11:47 PM
> To: amp834@rqinc.com
> Cc: Multiple recipients of list
> Subject: Re: [SWISH-E] Using Swish's Query Parser
>
>
> you don't say how you plan to "extract the text" of your potential
> document, or how you will "run a swish-format query" on the text.
>
> It wouldn't be very efficient, I wouldn't think, but you might just
> index the doc with swish-e and then search that temp index. swish-e is
> just as fast as anything else at "extracting text" and running the
> query. then you could simply delete the index (or repeat for each new
> doc, effectively using the same tmp index name).
>
> example perl off the top of my head:
>
> my $query = 'foo bar';
> my %include = ();
> for my $doc (@listofdocs) {
> 	indexdoc( $doc );
> 	if ( searchdoc( $query ) ) {
> 		$include{$doc}++;
> 	}
> }
>
> where indexdoc() and searchdoc() are functions that create your tmp
> index and then search it. you might define a special index name to use
> in your code, then remove it at end.
>
> Masoud Pirnazar wrote on 11/17/04 9:56 PM:
>
>
>>I have used Swish to index and search document collections, and now want
>
> to
>
>>"filter" documents before indexing using the same query syntax, i.e.
>>
>>Given a document, I will extract its text and want to run a swish-format
>>query on the text to see if it matches the query criteria; if it does, I
>>will add it to my collection.
>>
>>The simplest method is to add everything to a collection and do a swish
>>search on the collection, but I'm looking for a more efficient method,
>>especially if the hit percentage is small.
>>
>>Can anyone suggest anything?
>>I looked at the parse_swish_query and tokenize_query_string functions, but
>>it gets too complicated quickly.
>>
>>Thanks in advance for any ideas and comments.
>
>
> --
> Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com

--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Peter Karman peter@peknet.com 651.208.6116
Received on Thu Nov 18 12:35:40 2004