If I download the excel file and test it, I come up with this:
[Bart]$ perl -I.. Filter.pm test adr03rates.xls
Testing mode for Filter.pm
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
File: adr03rates.xls
Content-type: application/excel
** NOT FILTERED **
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
If I use the SWISH::Filter (with Spreadsheet::ParseExcel), it seems to try to parse it but with errors:
19796 Warning - http://localhost/2003/adr03rates.xls: substr
outside of string at /usr/local/lib/perl5/site_perl/5.8.0/Spreadsheet/ParseExce
l.pm line 1253.
19780 Warning - http://localhost/2003/adr03rates.xls: Use of
uninitialized value in unpack at /usr/local/lib/perl5/site_perl/5.8.0/Spreadshe
et/ParseExcel.pm line 1253.
Summary for: http://localhost/2003/adr03rates.xls
Skipped: 1 (0.0/sec)
Unique URLs: 1 (0.0/sec)
Removing very common words...
no words removed.
Writing main index...
err: No unique words indexed!
I am not sure if the ParseExcel module is causing the problem or not. Please help.
-----Original Message-----
From: Bill Moseley [mailto:moseley@hank.org]
Sent: Wednesday, May 28, 2003 11:00 AM
To: Roubart Capcap
Cc: Multiple recipients of list
Subject: Re: [SWISH-E] Good Excel parser
On Wed, May 28, 2003 at 09:00:19AM -0700, Roubart Capcap wrote:
> Hello,
>
> Does anybody know of a good Excel parser? I tried the Swish Filters
> with the following code in my spider.pl:
>
> use lib '/swish-e-2.2.3/filters/SWISH/Filters';
> use XLtoHTML;
> sub xl {
> my ( $uri, $server, $response, $content_ref ) = @_;
> return 1 unless $response->content_type eq 'application/vnd.ms-excel';
> # for logging counts
> $server->{counts}{'XLS transformed'}++;
> $$content_ref = ${XLtoHTML( $content_ref )};
> $$content_ref =~ tr/ / /s;
> return 1;
> }
I assume you have Spreadsheet::ParseExcel installed? I also don't know
if you can call XLtoHTML() directly. You should call it from
SWISH::Filter. See
http://swish-e.org/dev/docs/Filter.html
There's also a "TESTING" section that shows how to test the filter
outside of swish-e or spider.pl.
It says:
[This module can be run as a program directly. Change directory
to the location of the Filter.pm module and run:
perl -I.. Filter.pm test foo.pdf bar.doc
replace foo.pdf and bar.doc with real paths on your system. The -I.. is
needed for loading the filter modules.]
You don't really have to change directory to the location of Filter.pm.
You can run from any directory. For example
perl -I/home/moseley/swish-e/filters \
/home/moseley/swish-e/filters/Filter.pm \
test.xls
That should show you if the filtering is working.
BTW -- The new version of Swish-e has a filter (a SWISH::Filter) that
uses the Perl module Spreadsheet::ParseExcel (available from CPAN). The
new spider.pl will automatically use it if you have
Spreadsheet::ParseExcel installed.
--
Bill Moseley
moseley@hank.org
Received on Wed May 28 22:43:28 2003