Skip to main content.
home | support | download

Back to List Archive

RE: How to ignore a section of a page

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Sat Jun 21 2003 - 14:39:28 GMT
On Fri, Jun 20, 2003 at 10:10:18AM -0700, Antun Karlovac wrote:
> > Now, I just need to figure out how to get it to ignore pages 
> > that are linked that I have within there.
> 
> That's easy - you tell the spider to ignore everything between the
> comments. That's what we did first, but then realized that the side
> effect of this was that pages there weren't indexed (which is what you
> want, but I don't).
> 
> It was something like:
>  filter_content => sub {
>    my $content_ref = $_[3];
>    $$content_ref =~ s/<!-- ignoreThis -->.*<!-- \/ignoreThis -->//gs;
>    return 1;
>  },

Don't do that! Besides the point that links are extracted before 
"filter_content" is called, you need to remember that perl regular 
expressions are greedy.  Read the perldoc perlre docs.

moseley@bumby:~$ cat t.pl
my $content = <<EOF;
keep this
<!-- ignoreThis -->
drop this
<!-- /ignoreThis -->
Hey, where did this go?
<!-- ignoreThis -->
drop this, too
<!-- /ignoreThis -->
all done!
EOF

$content_ref = \$content;

$$content_ref =~ s/<!-- ignoreThis -->.*<!-- \/ignoreThis -->//gs;
print $content;


moseley@bumby:~$ perl t.pl
keep this

all done!



-- 
Bill Moseley
moseley@hank.org
Received on Sat Jun 21 14:39:29 2003