Skip to main content.
home | support | download

Back to List Archive

Re: avoid indexing php code

From: David L Norris <dave(at)not-real.webaugur.com>
Date: Thu Dec 20 2001 - 18:51:04 GMT
On Thu, 2001-12-20 at 09:27, Bill Moseley wrote:
> Well, the basic problem is that you are trying to index something that's
> not HTML, so understandably the HTML parsers (and HTML2) get confused about
> that.  HTML2 gives a warning, but continues on, and assumes > is the end
> tag.  HTML2 isn't in the windows version, I guess.
> 
> [David, what the status of getting libxml2 built in the windows package?]

I end up banging my head on the desk every time I look at it... :-(
Windows NT/2000 is no problem thus far.  I'm thinking perhaps, for now,
I might bundle a seperate executable for NT-based systems.  I just need
to work out how to detect the OS properly using NSIS.

> I would think that you would actually want to index the text that php
> generates in your documents, and thus use php as a filter as David
> suggested, or spider your web server.

I think that's the only way to get a good PHP-free index.  The nature of
PHP pretty much ensures that a HTML parser is going to have problems.  A
XML parser might be able to partially sort it out but you're still going
to miss content (generated by echo, print, printf, etc).  And, there's
bound to be many HTML tag characters embedded within <? ?>, <% %>,
<script language=php> </script>, and such to cause trouble.

-- 
 David Norris
  Dave's Web - http://www.webaugur.com/dave/
  Augury Net - http://augur.homeip.net/
  ICQ Universal Internet Number - 412039
  E-Mail - dave@webaugur.com

  "I once went to the store to buy a computer but
   the salesman tried to sell me windows instead..."
Received on Thu Dec 20 18:51:10 2001