On Tue, Apr 29, 2003 at 03:30:23PM -0700, Robert Keith wrote:
>
> I am having a strange problem indexing a combination of MSWord, .txt and PHP
> documents using spider.pl and feeding this into swish-e. If I index the PHP
> urls first, the documents are parsed and loaded as HTML. If I select the
> MSWord and other documents, which are filtered by the spider.pl filter
> routines, the MSWord and other documents are parsed as TXT (correctly), then
> when the subsequent PHP and HTML documents are parsed, they are parsed as
> TXT. The SwishSpiderConfig.pl file contains two entries, the URL with the
> MSWord links, and the URL with only PHP links.
Ah yep, I see the problem. If you look below you notice that
$server->{parser_type} is only set if the document is filtered.
It needs to be cleared. Try adding the line below.
I don't knw why request-specific data is in that global structure. Put it
on my todo list...
> The prof1.pl spider.pl config file contains:
[...]
> sub filter_content {
> my ( $uri, $server, $response, $content_ref ) = @_;
delete $server->{parser_type};
>
> my $content_type = $response->content_type;
>
> # Ignore text/* content type -- no need to filter
> return 1 if !$content_type || $content_type =~ m!^text/!;
>
> # Load the module - returns FALSE if cannot load module.
> unless ( $filter ) {
> eval { require SWISH::Filter };
> if ( $@ ) {
> $server->{abort} = $@;
> return;
> }
> $filter = SWISH::Filter->new;
> unless ( $filter ) {
> $server->{abort} = "Failed to create filter object";
> return;
> }
> }
>
> # If not filtered return false and doc will be ignored (not indexed)
>
> return unless $filter->filter(
> document => $content_ref,
> name => $response->base,
> content_type => $content_type,
> );
>
> # nicer to use **char...
> $$content_ref = ${$filter->fetch_doc};
>
> # let's see if we can set the parser.
> $server->{parser_type} = $filter->swish_parser_type || '';
>
> return 1;
> }
>
>
>
>
>
> # Must return true...
>
> 1;
>
Received on Wed Apr 30 04:44:00 2003