indexing files with multiple MIME parts

From: Andy Jacobson <andyj(at)>
Date: Tue Aug 17 2004 - 14:41:58 GMT

        I've been using swish-e to index my email for some time now,
        but without paying any attention to the binary MIME
        attachments that the messages contain.  Now I would like to
        index the MS-Word and PDF attachments as well.

        I've written a perl script that uses MIME attachment
        processing code from CPAN to extract the attachments and hand
        them off to SWISH::Filter for filtering.  So far, so good; all
        the parts are processed properly and swish-e inputs are

        However, each MIME message will produce at least two
        parts, with different content types.  The email text will be
        text/plain, but perhaps the filtered PDF will be HTML.  So one
        email would produce multiple outputs, something like:

Path-Name: ./1964
Content-Length: 1001
Last-Mtime: 1092715695
Document-Type: TXT*

text text text ...

Path-Name: ./1964
Content-Length: 49099
Last-Mtime: 1092753193
Document-Type: HTML*

<html> ..... </html>

       Can swish-e handle this?  Two separate inputs for the same
       file?  Can those outputs be of different content types?  I
       suppose the laternative is to attempt to convert everything to
       text/plain, combine content lengths, and feed swish-e just one
       input per file.


Andy Jacobson

Program in Atmospheric and Oceanic Sciences
Sayre Hall, Forrestal Campus
Princeton University
PO Box CN710 Princeton, NJ 08544-0710 USA

Tel: 609/258-5260  Fax: 609/258-2850
Received on Tue Aug 17 07:42:18 2004