Skip to main content.
home | support | download

Back to List Archive

Re: indexing files with multiple MIME parts

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Aug 17 2004 - 15:00:37 GMT
On Tue, Aug 17, 2004 at 07:39:32AM -0700, Andy Jacobson wrote:
> Path-Name: ./1964
> Content-Length: 1001
> Last-Mtime: 1092715695
> Document-Type: TXT*
> 
> text text text ...
> 
> Path-Name: ./1964
> Content-Length: 49099
> Last-Mtime: 1092753193
> Document-Type: HTML*
> 
> <html> ..... </html>
> 
>        Can swish-e handle this?  Two separate inputs for the same
>        file?

Yes, the Path-Name (swishdocpath) is just another property and there's
nothing in swish that requires properties to be unique.  Each
file/record, though, will have a unique file number.

What will happen is that a search may find both records and you will
see swish report the same "file name" twice.


>        Can those outputs be of different content types?

Sure, that only determines the kind of parser used -- once the words
are in the index don't know where they came from (html/txt/pdf).


>        I suppose the laternative is to attempt to convert everything
>        to text/plain, combine content lengths, and feed swish-e just
>        one input per file.

Yes, if you want to make sure that there's only one file number -- you
can also place the text part in <pre> or rewrite the catdoc and pdf
filters to produce text and place them in some separate metatag like
<attachment> so they can be searched separately.

Are you doing this with mail archives or with incoming mail?  I've
wanted to index my incoming mail for a while, but never have got
around to it.  I want both a web-based and command line interface for
searching messages and the option to tag and drop the search results
into a new Maildir folder.  There's some tools to do this already
(http://www.rrbcurnow.freeuk.com/mairix/ is one --
http://lurker.sourceforge.net/ is another) but they, for some reason
beyond me, don't use swish.

-- 
Bill Moseley
moseley@hank.org
Received on Tue Aug 17 08:00:51 2004