Skip to main content.
home | support | download

Back to List Archive

Re: RE: Filter, 2.0, ishtml

From: David Norris <dave(at)not-real.webaugur.com>
Date: Tue Jul 18 2000 - 15:13:28 GMT
Rainer.Scherg@rexroth.de wrote:
> Of course, this could be done by the cgi script. But some features
> should be done by the index process (e.g. storing a short description of
> a document - Meta Tag or first xx word of the doc.]

I agree.  SWISH-E should do that correctly.  My only reason for checking
in the CGI was to reflect changes in the file which occured after
indexing.  For instance, the file has changed since the site was last
indexed.  

I've decided to rewrite my SWS PHP interface for SWISH-E 2.0.  I
"released" SWS 1.0 last night since it had been in beta for 8 months or
more without any real bug reports.  (Windows users still have trouble
with directory seperators even with the examples in the config file.  It
is quite confusing, especially on Win9x.)

> This would be only be a temp. solution, 

It's certianly not a permanent solution.  But, it does get past the
immediate problem.  It's a drastic improvement over assuming almost
everything is text.  It may be a bit more wasteful of CPU time.  With
Jose's modifications I can index my entire site twice, filtering every
file, faster than it updates my terminal :-)  I don't mind wasting a
little CPU time, anyway.

> because HTML stores/handles titles different than e.g. PDF files 
> (in what way should a filter script return the document title?) 
> - has to be discussed...

I think it could be done several ways.  Either a SWISH-E XML DTD could
be created or non-HTML constructs could be converted to doc-properties
metadata.  I think the second option could be implemented with 2.0 as it
is now (by hacking up ishtml).  It may require special filter scripts,
though.

> Even if apache has a (IMO) braindead in handling content neg. ("Error 406" -
> but that's another story...), more and more are using content negotiation.
> Swish should be able to take care about this.

406...  I think I know where that's going ;-)  Altavista, Snap,
Excite...I could go on :-(  Googlebot works ;-)
 
> A quick bugfix would be checking for ".html" at the end of a filename, etc.
> and also for ".html.", etc. within an filename. But as you described, this
> would not fix the php problem.

I thought of doing a "string contains" instead of "string compare"
however that's really no better.  (I can only wonder why HTM and htm
extensions are both in ishtml when one could have done a case
insensitive string compare.  And I really have to wonder why .html files
aren't considered HTML.)

>   NoContents      .avi .mpeg .wav .some-junk    # only path will be
>   IndexContents   HTML  .html .htm .shtml   .htm.  .html. .shtml.   #index
>   IndexContents   XML   .xml
>   IndexContents   WAP   .wap
>   IndexContents   TXT   .txt .txt.
>   FileFilter      .doc  doc-filter.sh

I like this idea.  It has definite potential.

> This would make "IndexOnly" obsolete and would result in a redesign of the
> index/parser engine... (would be a major change...). 

I think many of us would agree rewriting the indexer is not a bad idea
;-)  Definitely a bit daunting, but, not a bad idea.  The HTML parser
needs to be replaced with a correct implementation (one of the Free HTML
parser libaries might be a good option).  I still see problems caused by
improper support for HTML.  The current HTML parser might have been fine
in 1994, but, it fails quite often today.

>  (hey someone with a correct footer line ;-)

Few people even notice ;-)

-- 
,David Norris
  Dave's Web - http://www.webaugur.com/dave/
  Dave's Weather - http://www.webaugur.com/dave/wx
  ICQ Universal Internet Number - 412039
  E-Mail - dave@webaugur.com
Received on Tue Jul 18 08:11:38 2000