On Fri, Jun 20, 2003 at 10:10:18AM -0700, Antun Karlovac wrote:
> > Now, I just need to figure out how to get it to ignore pages
> > that are linked that I have within there.
>
> That's easy - you tell the spider to ignore everything between the
> comments. That's what we did first, but then realized that the side
> effect of this was that pages there weren't indexed (which is what you
> want, but I don't).
>
> It was something like:
> filter_content => sub {
> my $content_ref = $_[3];
> $$content_ref =~ s/<!-- ignoreThis -->.*<!-- \/ignoreThis -->//gs;
> return 1;
> },
Don't do that! Besides the point that links are extracted before
"filter_content" is called, you need to remember that perl regular
expressions are greedy. Read the perldoc perlre docs.
moseley@bumby:~$ cat t.pl
my $content = <<EOF;
keep this
<!-- ignoreThis -->
drop this
<!-- /ignoreThis -->
Hey, where did this go?
<!-- ignoreThis -->
drop this, too
<!-- /ignoreThis -->
all done!
EOF
$content_ref = \$content;
$$content_ref =~ s/<!-- ignoreThis -->.*<!-- \/ignoreThis -->//gs;
print $content;
moseley@bumby:~$ perl t.pl
keep this
all done!
--
Bill Moseley
moseley@hank.org
Received on Sat Jun 21 14:39:29 2003