On Fri, Jul 18, 2003 at 02:12:48PM -0400, Jeffrey.Grunstein@ny.frb.org wrote:
>
> We are using the prog method, so that's -S prog.
> And it is now indexing the spreadsheets so we're not
> getting "wrong content type" any more.
Ah, ok, so you have:
test_response => sub {
my $content_type = $_[2]->content_type;
my $ok = grep { $_ eq $content_type } qw{ text/html text/plain application/pdf application/msword application/vnd.ms-excel application/x-excel };
return 1 if $ok;
print STDERR "$_[0] wrong content type ( $content_type )\n";
return;
},
So it looks like you need to add the other content type there. Did you
try that?
> We do have Spreadsheet::ParseExcel installed
> but I don't think we're using SWISH::Filter.
Doesn't look like it from the config you sent, but I'm confused about
this. You are telling the spider to run these filters:
filter_content => [ \&pdf, \&doc, \&xls ],
Now the xls filter in your config looks like:
use XLtoHTML;
sub xls {
my ( $uri, $server, $response, $content_ref ) = @_;
# return 1 unless $response->content_type eq 'application/x-excel';
# return 1 unless $response->content_type eq 'application/vnd.ms-excel';
return 1 unless ($response->content_type eq 'application/vnd.ms-excel') or
($response->content_type eq 'application/vnd.ms-excel');
$$content_ref = ${XLtoHTML( $content_ref )};
$$content_ref =~ tr/ / /s;
# for logging counts
$server->{counts}{'XLS transformed'}++;
return 1;
}
For one thing, you are specifying the same content type twice.
Second, I assumed that the XLtoHTML module you are using was part the
SWISH::Filter::XLtoHTML module. Where is that module from?
You can always just add some debugging code right in that xls() sub:
print STDERR "Converted spreadsheet is:\n", $$content_ref,"\n";
> It is indexing the XLS files but it's only indexing 2 words
> for each of them. Why?
You asked that before -- you can see what words are being indexed with
-T indexed_words option. That might give you some hints at what is
happening. Knowing what those two words are would help in debugging.
--
Bill Moseley
moseley@hank.org
Received on Fri Jul 18 18:59:13 2003