Skip to main content.
home | support | download

Back to List Archive

Re: Draft of OpenDocument filter

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Nov 16 2005 - 14:27:22 GMT
On Tue, Nov 15, 2005 at 11:49:59PM -0800, Lars D. Noodén wrote:
> I have made a draft of a Swish-e filter for OpenDocument format:
> 
>  	http://www-personal.umich.edu/~lars/Swish-e/ODF2xml.pm
> 
> Don't laugh or cry at the coding.  It does the all-important function: 
> looking like it works -- mostly.

Looks like a good start.  Here's a few comments.

On my machine I have these mimetypes:

    application/vnd.sun.xml.calc                        sxc
    application/vnd.sun.xml.calc.template               stc
    application/vnd.sun.xml.draw                        sxd
    application/vnd.sun.xml.draw.template               std
    application/vnd.sun.xml.impress                     sxi
    application/vnd.sun.xml.impress.template            sti
    application/vnd.sun.xml.math                        sxm
    application/vnd.sun.xml.writer                      sxw
    application/vnd.sun.xml.writer.global               sxg
    application/vnd.sun.xml.writer.template             stw

Very minor, but technically you might want to use \Q on your regular
expressions:

    qr!\Qapplication/vnd.sun.xml.writer!,

just to say that the dots are really just dots.

Maybe you could just do this:

    mimetypes => [
        qr!^application/vnd\.sun\.xml\.!,
        qr!^application/vnd\.oasis\.opendocument\.!,
    ],


Why not use Archive::Zip instead of relying on the unzip program?

    use Archive::Zip qw( :ERROR_CODES );
    my $zip = Archive::Zip->new();
    die 'read error' unless $zip->read( 'test.sxw' ) == AZ_OK;

    my $content = $zip->contents( 'content.xml' ) || die "Failed to find content";
    my $meta = $zip->contents( 'meta.xml' ) || die "Failed to find meta";

Then you can also avoid dealing with this:

    open (IN, "unzip -cq $file $part |" );

        while ( <IN> ) {
            $xml .= $_;
        }

        close ( IN );

and instead just do:

    my $p = XML::Parser->new;
    $p->parse( $content );


Then you avoid the fork.  You also avoid:

    open (IN, "unzip -cq $file $part |" );

Which is fine in this case, but in general it's good to avoid passing
data through the shell.


> Also, there is a problem with Filter.pm not knowing what to do with the 
> file so I've made a temp. kludge with the line:
>  	$mimetype = 'text/xml';

That's what you are suppose to do.  You are converting from one mime
type to another mime type, so you should just say:

   $doc->set_content_type( 'text/xml' );

I don't really see why you would need to extract out the mimetype from
the opendoc file.


> I've run it through swish-filter-test, but when actually using it there 
> are some XML parsing errors which I cannot account for: stuff about 
> embedded nulls, junk, not well-formed.

That's where it might take more work.

I might be tempted to put this back on:

    <?xml version="1.0" encoding="utf-8"?>

I also get this:

    Wide character in print at /usr/local/bin/swish-filter-test line 165.

You might need to use :utf8 on STDOUT, but I'm not so sure what the
rest of the filter (and spider.pl) will deal with it.

> Also, how do I get the filter to create or earmark a value for 
> 'swishtitle' or other fields?

That's what I was talking about before.  For swishtitle you would have
to format as HTML and then use <title>.  The advantage of formatting
for html is that it makes it easy to index html and open doc files in
the same index (and same config file).

Again, you would need to parse the tree and extract out the content.
You might use something like XML::DOM, XML::TreeBuilder or XML::XPath
to "query" the xml file for the content parts you want to include in
your output document.

The other way is to use aliases in the swish config file to map the
tag names to things like "swishtitle" when indexing.

But the plus side of generating an html-like document is that it makes
it easy form someone just indexing html documents to then start
indexing opendoc docs because the filter makes them look like other
html docs.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Wed Nov 16 06:27:23 2005