Skip to main content.
home | support | download

Back to List Archive

RE: title and non html

From: <Rainer.Scherg(at)>
Date: Tue Nov 21 2000 - 12:49:09 GMT
Hi Jose!

you have implemented the "read_stream" routine.

We could use this feature to "scan" the content and
include a contenttype "MAGIC" (which should be default)
in the config.

MAGIC could decide on contentbase, which type of doc
has to be indexed...

On HTTP, we could parse the response header to determine the
content type...

> - looking now at funcion http_indexpath (http.c): if we have set the 
> document type based on its extension (Eg: IndexContents HTML 
> .htm .html), it has non sense checking for text ("text/") in the http 
> header and, as you states, we are using an html-like approach to get 
> the words instead of a plain txt. At least we have to check for 
> "text/html" or "text/plain". Anyway, I think that this piece 
> of code can 
> be removed. 

Mhh, I removed the "text/" check because it makes IMO no sense.
e.g. PDF will be sent as "application/...".

> - do_index_file (index.c) does not need title if we get it inside the 
> countwords_XXX routine. Making things this way, probably 
> DOCENTRY does not need title and this structure can be removed.
> - xml.c, txt.c. Totally agree. They need some code for 
> title/summary/description and the title parameter in the call to the 
> routine is not needed.

> IMO summary/description means "title" for html documents. Other
> documents can have their own summary. So, any reference to title 
> should be removed outside the countwords_HTML routine.

I don't think so. IMO the definition could look like follow:

   title      = <Title>-Tag  (or path, see below...)
   Description= <META  http-equiv="Description"> | first xx chars of <BODY>

   title      = empty
   Description= first xx chars of file

   similar to HTML (has to be defined)

IMO we should store an empty title field, if there is no title
(which means: don't store the filepath twice).This will save space
in the database.

On retrieval, an empty title field should be returned as
"real_path" (URL, or filepath).

cu - rainer

This Mail has been checked for Viruses
Attention: Encrypted Mails can NOT be checked !

* * *

Diese Mail wurde auf Viren ueberprueft
Hinweis: Verschluesselte Mails koennen NICHT geprueft werden !
Received on Tue Nov 21 12:51:09 2000