Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] Limiting content from spider.pl

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Mar 29 2007 - 21:54:58 GMT
On Thu, Mar 29, 2007 at 02:18:28PM -0500, mitch-swish@claborn.net wrote:
> I want to eliminate some portions of the pages on our site from indexing -
> I've marked them in the HTML with specially formatted HTML comments.  

Ignore parts of pages or the entire pages?  To ignore parts of pages
(e.g. menus, headers, footers) you can use these comments:

    <!-- noindex -->
    <!-- index -->


> The way I made it work was to add this code at the very top of
> output_content in spider.pl (V 1.26):
> 
> if ( my $fn = $server->{alter_content} ) {
>     eval {
>         $fn->($server, $content, $uri, $response); 
>     };
>     die "alter_content died for $uri: $@\n" if $@;
> }
> 
> Is this a good way to accomplish it?  I put my actual logic in the config
> file of course.

Sure -- you can hack it to fit your needs.

> I could have also used the existing output_function callback, but there is a
> lot of miscellaneous stuff that happens after that call before the output
> that I would have to replicate in my code if I did so.

That's rather late in the process -- what happens after that makes
that not useful?

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Thu Mar 29 17:54:59 2007