Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] How can I adjust the META names before an HTML document is indexed?

From: Peter Karman <peter(at)>
Date: Mon Sep 10 2007 - 13:20:37 GMT
On 09/10/2007 12:03 AM, wrote:
> On 8 Sep 2007 at 19:36, Peter Karman wrote:
>> If you're using the or with -S prog, then yes, you
>> can filter the content with a regex and output additional <meta> tags 
>> with the content.
> I'm planning to do a -prog thing that would do its own xml-parsing 
> and pass just plain text for swish to index. Is it possible to 
> produce meta-fields in this scenario? The text would not have any 
> tags.. no "<" or ">" .. well, of course I could write them, but seems 
> like a waste to have swish parse it for xml a second time,
> Something like outputting:
> Path-Name: MYPATH
> Content-Lines: NUBWER_OF_LINES
> Last-Mtime: $mtime
> Document-Type: TEXT
> Meta: Subject=MYSUBJECT

If you want to add meta information, you must parse documents either as HTML or
XML. So you'd need to do something like:


It's necessary for the content to be XML or HTML -- swish-e has no other way of
parsing MetaNames or PropertyNames.

> (I changed the content-length -header wishfully to content-lines,
> as calculating the number of bytes swish thinks the file contains can be a
> bit tedios if I have lines ending in crlf, and others with just cr or lf..
> number of lines would be much easier. Also for swish, i think, if it reads
> the input line-by-line. But this is not so important)
>  .Timo
> _______________________________________________

Number of lines is something swish-e knows nothing about -- it just reads N
bytes into a buffer, parses them, and then reads another N bytes.

Peter Karman  .  peter(at)  .

Users mailing list
Received on Mon Sep 10 09:20:38 2007