Skip to main content.
home | support | download

Back to List Archive

Re: MIME Types of zipped files

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Nov 09 2005 - 14:43:02 GMT
On Wed, Nov 09, 2005 at 05:59:34AM -0500, Lars D. Noodén wrote:
> On Fri, 4 Nov 2005, Bill Moseley wrote:
> >... The filters register regular expressions of mime types they handle. 
> >If the incoming document matches the filters mime type then the filter 
> >is passed the incoming content.
> 
> Zipped or otherwise compressed files all have the same apparent MIME type 
> until they are actually unzipped.  How is that best resolved in Filter.pm?

$ file test.sxw
test.sxw: Zip archive data, at least v2.0 to extract


$ perl -MMIME::Types -wle 'print MIME::Types->new->mimeTypeOf("test.sxw")'
application/vnd.sun.xml.writer


> 
> >... One things the filters are not setup to do is to take a single file
> >(like a tar or zip file) and then index those as separate files.  It
> >should do that, but it doesn't.
> 
> Ok.  That's what I'm trying to set up.  Files in OpenDocument format are 
> really a collection of files and a few directories.  The three files 
> relevant for indexing are mimetype, content.xml and meta.xml, but they 
> cannot be seen until the whole thing is processed with unzip.

But that's different than a zip or tar archive that contains multiple
documents.  This is a single document made up of different parts, so
that's ok.

The process in the filter will be to look for 'application/vnd.sun.xml.writer'
Then use a perl module to extract out the content from the content.xml
part.  Then merge in the tags from meta using a XML processing module.
Then that is the output from the filter.

I'd take a a look at a few things on CPAN:

    http://search.cpan.org/~jmgdoc/OpenOffice-OODoc-2.014/OODoc/Intro.pod
    http://search.cpan.org/~jmgdoc/OpenOffice-OODoc-2.014/

But you may just need:

    http://search.cpan.org/~smpeters/Archive-Zip-1.16/lib/Archive/Zip.pod

    use Archive::Zip qw( :ERROR_CODES :CONSTANTS );
    my $zip = Archive::Zip->new( 'test.sxw' );
    print $zip->contents( 'content.xml' );

Of course, you would do error checking, which I left out.
That would be more portable than, say:

    unzip -c test.sxw content.xml

And then use some XML processing module to merge the meta.xml and
content.xml into a single xml file.  Perhaps:

    http://search.cpan.org/~msergeant/XML-XPath-1.13/XPath.pm

I think you will want to extract out the text content.  Maybe that
OODoc Perl module would help with that (I have never used that
module).  I'm not sure how you would know what all <text:> tags to
extract in a general case.  Maybe in the OO format tag content is
always the doc content???

Same with the meta.xml file.  You may also want to convert the ISO dates to
unix time stamps in some cases.

Keep in mind that OO files are utf8 -- so you won't be able to index
all characters.

Finally, the "tidy" program might be useful when working on this:

$ tidy -xml meta.xml

line 1 column 1 - Access: [3.2.1.1]: <doctype> missing.
line 1 column 1 - Access: [3.3.1.1]: use style sheets to control presentation.

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE 
office:document-meta PUBLIC "-//OpenOffice.org//DTD OfficeDocument 1.0//EN"
"office.dtd">
<office:document-meta xmlns:office="http://openoffice.org/2000/office"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:meta="http://openoffice.org/2000/meta" office:version="1.0">
    <office:meta>
        <meta:generator>OpenOffice.org 1.1.4 (Linux)</meta:generator>
        <!--645(Build:8824)-->
        <meta:creation-date>2005-08-25T20:18:59</meta:creation-date>
        <dc:date>2005-08-25T21:51:23</dc:date>
        <meta:print-date>2005-08-25T21:50:24</meta:print-date>
        <dc:language>en-US</dc:language>
        <meta:editing-cycles>3</meta:editing-cycles>
        <meta:editing-duration>PT1H32M27S</meta:editing-duration>
        <meta:user-defined meta:name="Info 1" />
        <meta:user-defined meta:name="Info 2" />
        <meta:user-defined meta:name="Info 3" />
        <meta:user-defined meta:name="Info 4" />
        <meta:document-statistic meta:table-count="0" meta:image-count="0"
        meta:object-count="0" meta:page-count="1"
        meta:paragraph-count="21" meta:word-count="463"
        meta:character-count="2476" />
    </office:meta>
</office:document-meta>




-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Wed Nov 9 06:43:03 2005