Hi Jose!
BTW:
I forgot to say: I made some remarks in the source marked with
"$$" (e.g. /* $$ ... */).
Please read this comments. This things have IMO be discussed.
The comments can be removed when the problem is solved...
Search with e.g. grep for "$$".
e.g. concerning the routines isoktitle () and isokhtml()
> Hi Rainer,
>
> On 19 Nov 2000, at 18:45, Rainer.Scherg@rexroth.de wrote:
>
>
> > But here are some things which are still to be discussed & to be
> > done...
> >
> >
> > Swish-e:
> >
> > ToDo and some questions for my understanding:
> >
> > - there is a "title" passed from "outside" to the index
> routines...
> > this seems to be for historic reasons, when swish did only HTML.
> > mostly the "title" contains the filepath.
> >
> > At this point, I let this still be untouched...
> > but, we should get rid of this relicts.
> >
> > the title should be retrieved within the indexing routine for a
> > doctype. (XML may be different to HTML or other types...)
> >
> >
>
> IMO, we can use this field to store the summary of the document.
Same point. The "Description" (IMO first words of a document
- not a summary or abstract) can only be retrieved by the
indexing routine for a document itself.
So: IMO we don't need this.
"Description" storage can be done like follows (standard way):
index_string is saving (somehow) n bytes of the first words of a
document. This string can be stored as a description...
For each document type there may be also an alternative method (e.g.
HTML Meta-Tag "Description" can override this string).
>
> > - "indextitleonly" (now: fprop->index_no_content) is not
> honoured in
> > each
> > index-routine (only the original one: countwords).
> > Should be done.
> >
>
> Which is the title for non html docs? Perhaps for non-html docs this
> field should ne interpreted as "indexsummaryonly"
The "indextitleonly" variable was triggered by the "NoContents" config
directive.
So IMO the name for the variable was wrong in the first place.
Therefor I renamed it in the FileProp structure to fprop->index_no_content.
The docs reads as follows:
NoContents *.suffix1 .suffix2 .suffix3 ...*
This variable lets you control what files will have their
contents indexed. If a file with a suffix in this list is
indexed, only its file name (and not any words in the file)
will be indexed. This is useful because normally swish-e
will try to index the contents of every file, even files
without words (such as images or movies). Suffix checking is
case-insensitive.
We can enhance this by saving the "doc title" instead as of the filename.
But this has to be decided by the indexing routine for a document
(e.g. not possible for a TXT doc).
But point is, e.g. countwords_txt doesn't check this variable at the moment
and is ignoring the "NoContents" settings.
> >
> > - in routine "indexafile": DOCENTRY *e only contains the
> filename...
> > (and the misplaced "title")
> > What do we need this structure for?
> >
>
> You are right, it is non sense if title is equal to filepath.
> I need to
> look at the code because there are other functions affected by
> DOCENTRY like indexadir and addsortentry.
Yep, right.
indexadir is the same problem.
>I do not remember at
> this moment if there can be a situation with different
> filepath and title.
Yes there is (historically)...
When an HTML-Doc is indexed the title var could contain the <Title>-tag.
But this was done in assumption, that each doc is a html doc.
IMO this behavior has to be placed into countword_xxx.
cu - rainer
----------------------------------------------------------------------
This Mail has been checked for Viruses
Attention: Encrypted Mails can NOT be checked !
* * *
Diese Mail wurde auf Viren ueberprueft
Hinweis: Verschluesselte Mails koennen NICHT geprueft werden !
----------------------------------------------------------------------
Received on Mon Nov 20 11:07:04 2000