At 12:52 AM 12/20/2001 -0800, Klaus Hollenbach wrote:
>It seems that only parts of the php-script in a html-file are being
>indexed.
>
>content of test.php:
>01 <html>
>02 <head>
>03 <title>Titletext</title>
>04 </head>
>05 <body>
>06 Bodytext
>07 <?php
>08 do{
>09 ...something...
>10 echo ("<option>");
>11 }while( expression );
>12 ?>
>13 </body>
>14 </html>
Well, the basic problem is that you are trying to index something that's
not HTML, so understandably the HTML parsers (and HTML2) get confused about
that. HTML2 gives a warning, but continues on, and assumes > is the end
tag. HTML2 isn't in the windows version, I guess.
[David, what the status of getting libxml2 built in the windows package?]
I would think that you would actually want to index the text that php
generates in your documents, and thus use php as a filter as David
suggested, or spider your web server.
I suppose you could write a Filter that uses a regular expression to remove
everything between <? and ?> or to placed the content in comments:
Off the top of my head...
#!/usr/local/bin/perl -w
use strict;
my $doc = join '', <>;
$doc =~ s/<\?/<!-- php\n<?/g;
$doc =~ s/\?>/?>\n -->/g;
print $doc;
No, that only works with the HTML2 parser, the internal swish parser isn't
that smart to find the real end of the comment.
#!/usr/local/bin/perl -w
use strict;
my $doc = join '', <>;
$doc =~ s/<\?.+?\?>/<!-- php removed -->\n/gs;
print $doc;
Then use on unix:
FileFilter .html ./filter.pl "'%p'"
On windows you probably need to do:
FileFilter .html ./filter.pl '"%p"'
or even
FileFilter .html "perl filter.pl" '"%p"'
BUT, if you have a lot of docs, that will be a lot slower to index than
using -S prog with DirTree.pl, and just placing the above regular
expression in the DirTree.pl program before you calculate the content length.
Bill Moseley
mailto:moseley@hank.org
Received on Thu Dec 20 14:27:50 2001