RE: Filter, 2.0, ishtml

From: Rainer.Scherg
Date: Tue Jul 18 2000 - 07:53:54 GMT

As I remember the subroutine countwords (index.c) should do all the
indexing of a file. This routines treats all input like HTML, so
text input is a "HTML file with no tags".

But the index process has still some design flaws.

E.g.:  there are still "small" bugs:
    - Filesize is wrong (== 0) on filtered files.
    - No Title for filtered files (e.g.: PDF-Subject or Title Fields)
    - Checking only for HTML file on the extension 
       'html, shtml, htm' e.g. fails, if  - as we do - you
       are using apache multiviews features. In this case filenames
       are named:, foo.htm.en,, etc.

So there should be a config of a mime-type config feature,
and/or better a kind of a "magic" - feature.

But the only effect I got so far, is: 
 ... the HTML/PDF/DOC title tag is not retrieved...

Perhaps we should do a little redesign here in maybe V2.1.

cu - rainer

-----Original Message-----
From: David Norris
Sent: Tuesday, July 18, 2000 6:10 AM
To: Multiple recipients of list
Filter, 2.0, ishtml

Is there some way to indicate that a filter returns HTML instead of
text?  If not, perhaps we should come up with some way to specify that a
particular file extension or filter returns HTML.  Hard coding the
(is)HTML file extensions into the binary just doesn't make sense to me.

As an example, I wrote a filter to pass my PHP documents through the PHP
CGI so I don't have to use the Robot to index all of my meta data and
such.  I can do it in the filesystem mode with the filtering.  The
result is HTML, of course.  For the moment, I hacked up fs.c to treat
.php3 files as HTML.  (BTW, It works marvelously! :-)

In fact, I was thinking that it might be possible, perhaps with some
modifications, to combine the filtering and HTTP mode with WGet to
create an enormously more powerful robot than the simple PERL script

