On Fri, Jul 25, 2003 at 12:24:37PM -0700, Roubart Capcap wrote:
> I am planning to add xl2csv as another filter to parse MS Excel files besides the XLtoHTML.pm filter. I copied Doc2txt.pm and made it xls2csv.pm with the following changes:
>
> package SWISH::Filters::xls2csv;
> use vars qw/ %FilterInfo $VERSION /;
>
>
> $VERSION = '0.01';
>
> %FilterInfo = (
> type => 2, # normal filter
> priority => 50, # normal priority 1-100
> );
>
> sub filter {
> my $filter = shift;
>
> # Do we care about this document?
> return unless $filter->content_type =~ m!application/vnd.ms-excel!;
>
> # We need a file name to pass to the xls2csv program
> my $file = $filter->fetch_filename;
>
> # Grab output from running program
> my $content = $filter->run_program( 'xls2csv', $file );
>
> # update the document's content type
> $filter->set_content_type( 'text/plain' );
>
> How and where do I specify that xls files should be parsed by both
> filters.
Both filters? If you are converting to csv then you wouldn't want the
other to Excel filter to process it, would you?
Anyway, the type and priority are what set the sort order of the
filters. If you have a filter where you still want other filters to
process it instead of finishing after your filter you call
$filter->set_continue. (All this if from looking at the docs, since I
can't remember how it works....)
>And how do I specify that the output of xls2csv should be
> parsed by the TXT2 parser?
The way swish works normally is by mapping file extensions to the
parser. That's not a very good way to go, of course. Someday I'll add
processing by content-type internal to swish (or that's been the plan
for a while). But if using -S prog you can set the parser in a header.
I see this in spider.pl:
# Set the parser type if specified by filtering
if ( my $type = delete $server->{parser_type} ) {
$headers .= "Document-Type: $type\n";
} elsif ( $response->content_type =~ m!^text/(html|xml|plain)! ) {
$type = $1 eq 'plain' ? 'txt' : $1;
$headers .= "Document-Type: $type*\n";
}
So it's setting a Document-Type: header to select the parser.
Does that help?
--
Bill Moseley
moseley@hank.org
Received on Fri Jul 25 19:52:29 2003