Skip to main content.
home | support | download

Back to List Archive

Re: Indexing XLS Files

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Jul 18 2003 - 18:58:59 GMT
On Fri, Jul 18, 2003 at 02:12:48PM -0400, Jeffrey.Grunstein@ny.frb.org wrote:
> 
> We are using the prog method, so that's -S prog.
> And it is now indexing the spreadsheets so we're not
> getting "wrong content type" any more.

Ah, ok, so you have:

        test_response   => sub {
            my $content_type = $_[2]->content_type;
            my $ok = grep { $_ eq $content_type } qw{ text/html text/plain application/pdf application/msword application/vnd.ms-excel application/x-excel };
            return 1 if $ok;
            print STDERR "$_[0] wrong content type ( $content_type )\n";
            return;
        },

So it looks like you need to add the other content type there.  Did you
try that?

> We do have Spreadsheet::ParseExcel installed
> but I don't think we're using SWISH::Filter.

Doesn't look like it from the config you sent, but I'm confused about 
this.  You are telling the spider to run these filters:

        filter_content  => [ \&pdf, \&doc, \&xls ],

Now the xls filter in your config looks like:

use XLtoHTML;
sub xls {
   my ( $uri, $server, $response, $content_ref ) = @_;

#   return 1 unless $response->content_type eq 'application/x-excel';
#   return 1 unless $response->content_type eq 'application/vnd.ms-excel';
    return 1 unless ($response->content_type eq 'application/vnd.ms-excel') or
                    ($response->content_type eq 'application/vnd.ms-excel');

   $$content_ref = ${XLtoHTML( $content_ref )};
   $$content_ref =~ tr/ / /s;

   # for logging counts
   $server->{counts}{'XLS transformed'}++;

   return 1;
}

For one thing, you are specifying the same content type twice.

Second,  I assumed that the XLtoHTML module you are using was part the 
SWISH::Filter::XLtoHTML module.  Where is that module from?

You can always just add some debugging code right in that xls() sub:

   print STDERR "Converted spreadsheet is:\n", $$content_ref,"\n";

> It is indexing the XLS files but it's only indexing 2 words
> for each of them.  Why?

You asked that before -- you can see what words are being indexed with 
-T indexed_words option.  That might give you some hints at what is 
happening.  Knowing what those two words are would help in debugging.


-- 
Bill Moseley
moseley@hank.org
Received on Fri Jul 18 18:59:13 2003