On 09/10/2007 12:03 AM, harmo@valt.helsinki.fi wrote:
> On 8 Sep 2007 at 19:36, Peter Karman wrote:
>> If you're using the spider.pl or DirTree.pl with -S prog, then yes, you
>> can filter the content with a regex and output additional <meta> tags
>> with the content.
>
> I'm planning to do a -prog thing that would do its own xml-parsing
> and pass just plain text for swish to index. Is it possible to
> produce meta-fields in this scenario? The text would not have any
> tags.. no "<" or ">" .. well, of course I could write them, but seems
> like a waste to have swish parse it for xml a second time,
>
> Something like outputting:
> Path-Name: MYPATH
> Content-Lines: NUBWER_OF_LINES
> Last-Mtime: $mtime
> Document-Type: TEXT
> Meta: Subject=MYSUBJECT
> Meta: AUTHOR=MYAUTHOR
>
> DOCUMENT-CONTENT-TEXT
>
If you want to add meta information, you must parse documents either as HTML or
XML. So you'd need to do something like:
<doc>
<subject>MYSUBJECT</subject>
<author>MYAUTHOR</author>
<text>
DOCUMENT_CONTENT_TEXT_HERE
</text>
</doc>
It's necessary for the content to be XML or HTML -- swish-e has no other way of
parsing MetaNames or PropertyNames.
>
>
> (I changed the content-length -header wishfully to content-lines,
> as calculating the number of bytes swish thinks the file contains can be a
> bit tedios if I have lines ending in crlf, and others with just cr or lf..
> number of lines would be much easier. Also for swish, i think, if it reads
> the input line-by-line. But this is not so important)
> .Timo
> _______________________________________________
Number of lines is something swish-e knows nothing about -- it just reads N
bytes into a buffer, parses them, and then reads another N bytes.
--
Peter Karman . peter(at)not-real.peknet.com . http://peknet.com/
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Mon Sep 10 09:20:38 2007