Skip to main content.
home | support | download

Back to List Archive

Re: avoid indexing php code

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Dec 20 2001 - 14:27:21 GMT
At 12:52 AM 12/20/2001 -0800, Klaus Hollenbach wrote:

>It seems that only parts of the php-script in a html-file are being
>indexed.
>
>content of test.php:
>01 <html>
>02 <head>
>03 <title>Titletext</title>
>04 </head>
>05 <body>
>06 Bodytext
>07 <?php
>08 do{
>09         ...something...
>10         echo ("<option>");
>11 }while( expression );
>12 ?>
>13 </body>
>14 </html>

Well, the basic problem is that you are trying to index something that's
not HTML, so understandably the HTML parsers (and HTML2) get confused about
that.  HTML2 gives a warning, but continues on, and assumes > is the end
tag.  HTML2 isn't in the windows version, I guess.

[David, what the status of getting libxml2 built in the windows package?]

I would think that you would actually want to index the text that php
generates in your documents, and thus use php as a filter as David
suggested, or spider your web server.

I suppose you could write a Filter that uses a regular expression to remove
everything between <? and ?> or to placed the content in comments:

Off the top of my head...

#!/usr/local/bin/perl -w 
use strict;

my $doc = join '', <>;
$doc =~ s/<\?/<!-- php\n<?/g;
$doc =~ s/\?>/?>\n -->/g;
print  $doc;

No, that only works with the HTML2 parser, the internal swish parser isn't
that smart to find the real end of the comment.

#!/usr/local/bin/perl -w 
use strict;
my $doc = join '', <>;
$doc =~ s/<\?.+?\?>/<!-- php removed -->\n/gs;
print  $doc;

Then use on unix:
  FileFilter .html ./filter.pl "'%p'"  

On windows you probably need to do:

  FileFilter .html ./filter.pl '"%p"'

or even 

  FileFilter .html "perl filter.pl" '"%p"'

BUT, if you have a lot of docs, that will be a lot slower to index than
using -S prog with DirTree.pl, and just placing the above regular
expression in the DirTree.pl program before you calculate the content length.




Bill Moseley
mailto:moseley@hank.org
Received on Thu Dec 20 14:27:50 2001