On Wed, Jul 07, 2004 at 12:15:44PM -0700, Tac wrote:
> Bill asked why I thought indexing should be callable (like searching),
> rather than through a command line program. Here are my reasons:
>
> (1) I like having lots of control over the process. We're indexing millions
> of xml documents, and I like to have a better sense of where things are at,
> rather than just firing up the program and waiting.
Well, I guess I'd need to see an API first, as I can't picture how
that would be used. It would require a big rewrite of the indexing
part of swish -- as much of the code (like the config stuff) is very
much the same as it was in 1997.
> Now, both of those are more philosophical, the real reason I want to be able
> to index from within a perl (or other) script is so that I can index on the
> fly.
I'm not sure I understand. You mean index files individually?
> We have about 6 million documents, each document has between 1 and 500
> pages.
Not your typical web site of few 100 pages. ;)
> swish-e indexes the documents, but when displaying them I only want
> to display the appropriate pages (so if you search for a word that shows up
> on page 26, we display a fragment of page 26 and a link to the image. I
> should mention that all our documents are OCR of images).
Do you mean like at:
http://swish-e.org/current/docs/searchdoc.html
upon indexing my -S prog scrip splits the documentation up into chunks
and indexes them separately. That way searchs are more specific.
> So what I'd like to do is pass the page data (the OCR) and index it, then
> just search the individual pages. Since we'd be doing this for every
> document on the fly (and we often display 10 or more documents per page), it
> would involve a lot of resources. Fortunately, swish-e is incredibly fast.
> But I don't want to pay for the overhead of calling system() each time.
Calling system() for what?
> I'd also like to capture information about word counts and such, without
> having to parse the results of the index command line call.
I think most of that data is available in the C/Perl API.
You are making some big indexes. Report back on your findings.
--
Bill Moseley
moseley@hank.org
Unsubscribe from or help with the swish-e list:
http://swish-e.org/Discussion/
Help with Swish-e:
http://swish-e.org/current/docs
swish-e@sunsite.berkeley.edu
Received on Wed Jul 7 13:22:06 2004