I'm looking at Filter.pm and the existing filters, rather than spider.pl,=
=20
and seeing if I can't work out one for OpenDocument. My perl is quite=20
rusty though, so I may not get that far. swish-filter-test (in verbose=20
mode) has been a big help.
I initially thought I could simply get away with a configuration directive=
=20
like this:
FileFilterMatch unzip "-p '%p' content.xml meta.xml" /\.od[tspmhgcif]$/
but only the contents of the first file ever get indexed, not both.=20
That does work fine if I just want to index only the metadata, for=20
example, which is quite likely the case for document templates, but then=20
the MIME type gets left out.
-Lars
Lars Nooden (lars@umich.edu)
=09On the Internet, nobody knows you're a dog ...
=09... until you start barking.
On Fri, 4 Nov 2005, Bill Moseley wrote:
> On Fri, Nov 04, 2005 at 04:56:46AM -0800, Lars D. Nood=E9n wrote:
>> Does the filter need to return the real mime type[1]?
>
> All this could be improved vastly. The filters register regular
> expressions of mime types they handle. If the incoming document
> matches the filters mime type then the filter is passed the incoming
> content.
>
> The filter would then convert it into another mime type (normally that
> means into text/{html|xml|txt} ) and returns that text.
>
> In the case of the spider (that uses SWISH::Filter for its filtering
> needs) if the returned document is of type text/* then the document is
> passed onto swish.
>
> So, yes, you would need to set the mime type after filtering.
>
> Seems like looking at the existing filters might explain it better.
>
>> Can swish-e process two separate XML files (content + metadata) as one
>> if they are concatenated?
>
> If the resulting document is a valid xml file. You might want to use
> on of the xml parsers to merge the documents correctly.
>
> One things the filters are not setup to do is to take a single file
> (like a tar or zip file) and then index those as separate files. It
> should do that, but it doesn't.
>
> That can easily be hacked with spider.pl because swish is connected to
> stdout all you have to do is correctly format the document (add a few
> headers) and send it to stdout and it will get indexed.
>
>
*********************************************************************
Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
*********************************************************************
Received on Tue Nov 8 23:35:51 2005