Skip to main content.
home | support | download

Back to List Archive

Re: Trying indexing Excel files with XLtoHTML

From: <moseley(at)not-real.hank.org>
Date: Thu Aug 21 2003 - 14:10:37 GMT
On Thu, Aug 21, 2003 at 01:12:34AM -0700, Bucharow Leonard wrote:


> Now I'm trying to index .xls files. I've read few mails in the list, but
> don't understand really yet.

Good, an excuse to write a long email...

Yes, it's confusing.  There's three (or more) ways to filter in swish-e.
First there was the FileFilter which just lets you pipe each document
through a filter.  

Then the -S prog method was added to swish-e and we
created a few "filters" that could be used in a program such as
spider.pl.  That's the setup you are trying to use below where you call 
the filters manually inside the spider config file.

I then wrote a set of Perl modules called SWISH::Filter.  It works by 
just passing a document and its content-type to SWISH::Filter and you 
get back the filtered doc and a new content-type.  The idea there is new 
filters can be added to your system (by installing a SWISH::Filter::* 
module that does something like convert Excel files) and not additional 
configuration is needed.

The XLtoHTML filter is part of this SWISH::Filter system.

So you are mixing the last two methods and that's probably why it's not
working.

All that flexibility comes at a cost of confusion!

> -Skipped http://localhost/test/excel.xls due to 'filter_content' user
> supplied function #3 death 'Undefined subroutine &main::XLtoHTML called at
> /usr/local/swish-e/conf/SpiderConfig.pl line 205.
> "
> What does it mean? What's wrong? Can this code actually work?

It simply means you are calling a function called XLtoHTML() and it 
doesn't exist (in the "main" package).

So you should just use the third method.  If you look at the example 
spider config file "SwishSpiderConfig.pl" you will see a section that 
uses SWISH::Filter.

Note:  I just noticed I had some junk text at the start of the 
SWISH::Filter.pm module in CVS.  You may need to open it with an editor 
and remove the text.  It should just say:

  package SWISH::Filter;

I sometimes forget to move my mouse into the correct window before 
typing!

So in the server config you add a line like this:


    filter_content => \&filter_content,

that says to call the filter_content() subroutine for each document 
after it's been loaded into memory.

And here's the filter_content() subroutine.  You can see all this in 
the SwishSpiderConfig.pl file, but I'll add aditional comments here by 
using "##" if you want to know how it works.

# This is an example of how to use the SWISH::Filter module included
# with the swish-e distribution.  Make sure that SWISH::Filter is
# in the @INC path (e.g. set PERL5LIB before running swish).
#
# Returns:
#      true if content-type is text/* or if the document was filtered
#      false if document was not filtered
#      aborts if module or filter object cannot be created.
#

my $filter;  # cache the object.

sub filter_content {
    my ( $uri, $server, $response, $content_ref ) = @_;

    my $content_type = $response->content_type;

    # Ignore text/* content type -- no need to filter
    return 1 if !$content_type || $content_type =~ m!^text/!;
    

    # Load the module - returns FALSE if cannot load module.
    unless ( $filter ) {
        eval { require SWISH::Filter };
        if ( $@ ) {
            $server->{abort} = $@;
            return;
        }
        $filter = SWISH::Filter->new;
        unless ( $filter ) {
            $server->{abort} = "Failed to create filter object";
            return;
        }
    }

    # If not filtered return false and doc will be ignored (not indexed)

    ## Ok, here's where all the work is done.  You  pass in the document
    ## and its content type.  I believe the name is only used for 
    ## for printing messages
    
    ## Returning false indicates that SWISH::Filter was unable to handle
    ## the content type (or failed for some other reason) 
    ## the "false" is returned to spider.pl which will skip the document

    return unless $filter->filter(
        document => $content_ref,
        name     => $response->base,
        content_type => $content_type,
    );


    ## Since returned "true" replace the unfiltered document with the
    ## filtered document.

    # nicer to use **char...
    $$content_ref = ${$filter->fetch_doc};


    ## Finally, the SWISH::Filter has a table to assign a parser type
    ## (TXT*|HTML*|XML#) based on the new content type of the new
    ## filtered content.  That's added to a header value passed
    ## back to swish-e by spider.pl and tells swish-e what parser
    ## to use for this document.

    # let's see if we can set the parser.
    $server->{parser_type} = $filter->swish_parser_type || '';

    return 1;
}

Now, I believe you can turn on debugging of some type if having 
problems with SWISH::Filter.  Set the environment variable 
FILTER_DEBUG=1 and then watch stderr.  

You can also test the filters directly by running the module as a 
program.

NOTE: besides the problem noted above in Filter.pm, I had to add another
content type to the XLtoHTML.pm module.  You may not need to do this 
depending on what content type your server is returning for xml files.  
I needed to add:

   application/excel

So I changed this:

    return unless $filter->content_type =~ m!application/vnd.ms-excel!;

to:
    return unless $filter->content_type =~ m!(application/vnd.ms-excel|application/excel)!;




Here's to examples of testing the filters:

First cd to where Filter.pm is and set the @INC path with PERL5LIB.

moseley@bumby:~$ cd /usr/local/lib/swish-e/perl/SWISH/
moseley@bumby:/usr/local/lib/swish-e/perl/SWISH$ export PERL5LIB=`pwd`/..

moseley@bumby:/usr/local/lib/swish-e/perl/SWISH$ FILTER_DEBUG=1 perl Filter.pm test /home/moseley/party.xls 
Testing mode for Filter.pm


=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Find path of [catdoc] in /usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games
Looking at [/usr/local/bin/catdoc]
Looking at [/usr/bin/catdoc]
Find path of [pdftotext] in /usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games
Looking at [/usr/local/bin/pdftotext]
Looking at [/usr/bin/pdftotext]
Find path of [pdfinfo] in /usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games
Looking at [/usr/local/bin/pdfinfo]
Looking at [/usr/bin/pdfinfo]
trying to load [MP3::Tag]
Can not use Filter SWISH::Filters::ID3toHTML -- need to install MP3::Tag: No such file or directory

trying to load [Spreadsheet::ParseExcel]
 ** Loaded Spreadsheet::ParseExcel **
trying to load [HTML::Entities]
 ** Loaded HTML::Entities **
File: /home/moseley/party.xls
Content-type: text/html

<html>    
<head>
    <title>Sheet1 - /home/moseley/party.xls v.1536</title>
    <meta name="Filename" content="/home/moseley/party.xls">
    <meta name="Version" content="1536">
    <meta name="Sheetcount" content="3">
</head>
<body>
<h2>Sheet1</h2>
<table>
<tr>

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=



Here's another example with a pdf file:

moseley@bumby:/usr/local/lib/swish-e/perl/SWISH$ FILTER_DEBUG=1 perl Filter.pm test /home/moseley/apache/test.pdf
Testing mode for Filter.pm

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Find path of [catdoc] in /usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games
Looking at [/usr/local/bin/catdoc]
Looking at [/usr/bin/catdoc]
Find path of [pdftotext] in /usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games
Looking at [/usr/local/bin/pdftotext]
Looking at [/usr/bin/pdftotext]
Find path of [pdfinfo] in /usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games
Looking at [/usr/local/bin/pdfinfo]
Looking at [/usr/bin/pdfinfo]
trying to load [MP3::Tag]
Can not use Filter SWISH::Filters::ID3toHTML -- need to install MP3::Tag: No such file or directory

trying to load [Spreadsheet::ParseExcel]
Can not use Filter SWISH::Filters::XLtoHTML -- need to install Spreadsheet::ParseExcel: No such file or directory

File: /home/moseley/apache/test.pdf
Content-type: text/html

<html>    
<head>
<title>Acrobat Distiller 5.0.5 for Macintosh</title>
<meta name="author" content=" ">
<meta name="creationdate" content="Fri Mar 21 21:42:23 2003">
<meta name="creator" content="Microsoft Word: AdobePS 8.7.3 (301)">
<meta name="encrypted" content="no">
<meta name="file_size" content="32194 bytes">
<meta name="moddate" content="Fri Mar 21 21:42:23 2003">
<meta name="optimized" content="yes">
<meta name="page_size" content="612 x 792 pts (letter)">
<meta name="pages" content="2">
<meta name="pdf_version" content="1.3">
<meta name="producer" content="Acrobat Distiller 5.0.5 for Macintosh">
<meta name="tagged" content="no">
<meta name="title" content="Microsoft Word - LFE02a.doc">
</head>
<body>
<pre>

-- 
Bill Moseley
moseley@hank.org
Received on Thu Aug 21 14:11:29 2003