Skip to main content.
home | support | download

Back to List Archive

Re: Swish-e and OpenDocument and metadata

From: Lars D. Noodén <lars(at)not-real.umich.edu>
Date: Wed Nov 09 2005 - 07:35:47 GMT
I'm looking at Filter.pm and the existing filters, rather than spider.pl,=
=20
and seeing if I can't work out one for OpenDocument.  My perl is quite=20
rusty though, so I may not get that far.  swish-filter-test (in verbose=20
mode) has been a big help.

I initially thought I could simply get away with a configuration directive=
=20
like this:
  FileFilterMatch unzip "-p '%p' content.xml meta.xml" /\.od[tspmhgcif]$/

but only the contents of the first file ever get indexed, not both.=20
That does work fine if I just want to index only the metadata, for=20
example, which is quite likely the case for document templates, but then=20
the MIME type gets left out.

-Lars
Lars Nooden (lars@umich.edu)
 =09On the Internet, nobody knows you're a dog ...
 =09... until you start barking.

On Fri, 4 Nov 2005, Bill Moseley wrote:

> On Fri, Nov 04, 2005 at 04:56:46AM -0800, Lars D. Nood=E9n wrote:
>> Does the filter need to return the real mime type[1]?
>
> All this could be improved vastly.  The filters register regular
> expressions of mime types they handle.  If the incoming document
> matches the filters mime type then the filter is passed the incoming
> content.
>
> The filter would then convert it into another mime type (normally that
> means into text/{html|xml|txt} ) and returns that text.
>
> In the case of the spider (that uses SWISH::Filter for its filtering
> needs) if the returned document is of type text/* then the document is
> passed onto swish.
>
> So, yes, you would need to set the mime type after filtering.
>
> Seems like looking at the existing filters might explain it better.
>
>> Can swish-e process two separate XML files (content + metadata) as one
>> if they are concatenated?
>
> If the resulting document is a valid xml file.  You might want to use
> on of the xml parsers to merge the documents correctly.
>
> One things the filters are not setup to do is to take a single file
> (like a tar or zip file) and then index those as separate files.  It
> should do that, but it doesn't.
>
> That can easily be hacked with spider.pl because swish is connected to
> stdout all you have to do is correctly format the document (add a few
> headers) and send it to stdout and it will get indexed.
>
>


*********************************************************************
Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
*********************************************************************
Received on Tue Nov 8 23:35:51 2005