Hi!
>> E.g.: there are still "small" bugs:
>> - Filesize is wrong (== 0) on filtered files.
>That could be a minor problem. I actually check the file size, date,
>etc in my results script.
Of course, this could be done by the cgi script. But some features
should be done by the index process (e.g. storing a short description of
a document - Meta Tag or first xx word of the doc.]
>> - No Title for filtered files (e.g.: PDF-Subject or Title Fields)
>>
>This seems to be because of the ishtml file extension check.
>How is this for a temporary hack? Make ishtml into a stub:
>
>int ishtml(filename)
>char *filename;
>{
> return 1;
>}
>That handles the annoyance until we can fix it properly ;-) I don't
>know if we (collectively) would want that in 2.0, however, it may be
>good to list as a "known bug." I know some folks are anxious to get a
>2.0 release out (can't blame them :-)
This would be only be a temp. solution, because HTML stores/handles titles
different than e.g. PDF files (in what way should a filter script return
the document title?) - has to be discussed...
>> - Checking only for HTML file on the extension
>> 'html, shtml, htm' e.g. fails, if - as we do - you
>> are using apache multiviews features. In this case filenames
>> are named: foo.htm.de, foo.htm.en, foo.htm.es, etc.
>
>That's another problem I have. I use content negotiation everywhere.
>Listing dozens of file extensions is a problem. I really want it to
>index files of type text/*, image/png, application/pdf, and others with
>filters converting the content into HTML or text as needed. That's
>surely something for a future version.
Even if apache has a (IMO) braindead in handling content neg. ("Error 406" -
but that's another story...), more and more are using content negotiation.
Swish should be able to take care about this.
A quick bugfix would be checking for ".html" at the end of a filename, etc.
and also for ".html.", etc. within an filename. But as you described, this
would
not fix the php problem.
IMO we need a Conf directive, like:
ContentType .php3$ HTML
ContentType .html$ HTML
ContentType .html. HTML
ContentType .txt$ TEXT
ContentType .pdf$ TEXT (returned by filter)
ContentType .xml XML
Also a vice versa config would be possible (maybe better):
NoContents .avi .mpeg .wav .some-junk # only path will be
stored...
IndexContents HTML .html .htm .shtml .htm. .html. .shtml. #index
as HTML
IndexContents XML .xml
IndexContents WAP .wap
IndexContents TXT .txt .txt.
IndexContents TXT .pdf .poc .dot .xls # (filters are returning TXT)
FileFilter .doc doc-filter.sh
FileFilter .dot doc-filter.sh
FileFilter .pdf pdf-filter.sh
FileFilter .xls xls-filter.sh
This would make "IndexOnly" obsolete and would result in a redesign of the
index/parser
engine... (would be a major change...). But if this is done in a modular
design, new
parser engines could be installed in the future. So it could be easy to
decide to
add a new parser engine (e.g. for WAP files) or to handle this via external
filters.
(just some thoughts)
cu Rainer
(hey someone with a correct footer line ;-)
--
,David Norris
Dave's Web - http://www.webaugur.com/dave/
Dave's Weather - http://www.webaugur.com/dave/wx
ICQ Universal Internet Number - 412039
E-Mail - dave@webaugur.com
----------------------------------------------------------------------
This Mail has been checked for Viruses
Attention: Encrypted Mails can NOT be checked !
* * *
Diese Mail wurde auf Viren ueberprueft
Hinweis: Verschluesselte Mails koennen NICHT geprueft werden !
----------------------------------------------------------------------
Received on Tue Jul 18 09:48:57 2000