Skip to main content.
home | support | download

Back to List Archive

Trying indexing Excel files with XLtoHTML

From: Bucharow Leonard <Leonard.Bucharow(at)not-real.DLE-M.Bayern.de>
Date: Thu Aug 21 2003 - 08:12:38 GMT
Hi Bill and those people indexing MS-Excel files,

first thanks Bill for help with proxy server, it works fine now with
test_url callback function (I was too lazy to setting up LWP:UserAgent :-)).

Now I'm trying to index .xls files. I've read few mails in the list, but
don't understand really yet.

I'm using SWISH-E 2.4.0-pr1 with -S prog and spider.pl

I've installed the following perl moduls:
- Spreadsheet::ParseExcel
- OLE::Storage_Lite (requiered by SpreadSheet)
- IO::Scalar (IO-stringy-2.108, requiered by OLE)
- HTML::Entities (HTML-Parser-3.31)

So I said I don't really understand what I'm doing but I've changed/added
the following code (similar to pdf() and doc() ) in the SpiderConfig.pl:
"
test_response   => sub {
            my $content_type = $_[2]->content_type;
            my $ok = grep { $_ eq $content_type } qw{application/pdf
application/msword application/vnd.ms-excel};

            # This might be used if you only wanted to index PDF files, yet
spider still spider.
            #$_[1]->{no_index} = $content_type ne 'application/pdf';

            return 1 if $ok;
            print STDERR "$_[0] wrong content type ( $content_type )\n";
            return;
        },

        filter_content  => [ \&pdf, \&doc, \&xls ], # , \&xls doesn't work
yet!!!
 },
"
"
use lib '/usr/local/swish-e/lib/swish-e/perl/SWISH/Filters';
use XLtoHTML;
sub xls {
   my ( $uri, $server, $response, $content_ref ) = @_;

   return 1 unless $response->content_type eq 'application/vnd.ms-excel';

   $$content_ref = ${XLtoHTML( $content_ref )};
   $$content_ref =~ tr/ / /s;

   # for logging counts
   $server->{counts}{'XLS transformed'}++;

   return 1;
}
"
and during indexing I've got the following error:
"
-Skipped http://localhost/test/excel.xls due to 'filter_content' user
supplied function #3 death 'Undefined subroutine &main::XLtoHTML called at
/usr/local/swish-e/conf/SpiderConfig.pl line 205.
"
What does it mean? What's wrong? Can this code actually work?

Thanks in advance
For any help appreciative
Leo
Received on Thu Aug 21 08:13:14 2003