Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] How can I adjust the META names before an HTML document is indexed?

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Mon Sep 10 2007 - 13:20:37 GMT
On 09/10/2007 12:03 AM, harmo@valt.helsinki.fi wrote:
> On 8 Sep 2007 at 19:36, Peter Karman wrote:
>> If you're using the spider.pl or DirTree.pl with -S prog, then yes, you
>> can filter the content with a regex and output additional <meta> tags 
>> with the content.
> 
> I'm planning to do a -prog thing that would do its own xml-parsing 
> and pass just plain text for swish to index. Is it possible to 
> produce meta-fields in this scenario? The text would not have any 
> tags.. no "<" or ">" .. well, of course I could write them, but seems 
> like a waste to have swish parse it for xml a second time,
> 
> Something like outputting:
> Path-Name: MYPATH
> Content-Lines: NUBWER_OF_LINES
> Last-Mtime: $mtime
> Document-Type: TEXT
> Meta: Subject=MYSUBJECT
> Meta: AUTHOR=MYAUTHOR
> 
> DOCUMENT-CONTENT-TEXT
> 


If you want to add meta information, you must parse documents either as HTML or
XML. So you'd need to do something like:

<doc>
<subject>MYSUBJECT</subject>
<author>MYAUTHOR</author>
<text>
 DOCUMENT_CONTENT_TEXT_HERE
</text>
</doc>


It's necessary for the content to be XML or HTML -- swish-e has no other way of
parsing MetaNames or PropertyNames.

> 
> 
> (I changed the content-length -header wishfully to content-lines,
> as calculating the number of bytes swish thinks the file contains can be a
> bit tedios if I have lines ending in crlf, and others with just cr or lf..
> number of lines would be much easier. Also for swish, i think, if it reads
> the input line-by-line. But this is not so important)
>  .Timo
> _______________________________________________

Number of lines is something swish-e knows nothing about -- it just reads N
bytes into a buffer, parses them, and then reads another N bytes.


-- 
Peter Karman  .  peter(at)not-real.peknet.com  .  http://peknet.com/

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Mon Sep 10 09:20:38 2007