On 9/7/07 4:30 PM, Ben Ostrowsky wrote:
> I'd like to glean metadata from the documents I'm indexing. The
> documents have a predictable format:
>
> ...
> <BODY BGCOLOR="#ffffff">
> <H1>[list-name] title of message</H1>
> <B>name of message author</B>
> <A HREF="..."
> TITLE="...">username at email.host
> </A><BR>
> ...
>
> I'd like to be able to search these documents with "swish-e -w
> authorname=foo" or "swish-e -w authoremail=bar".
>
> At what point during the process of indexing would it be possible to
> manipulate things so that I can do this? Can I, for example, add a
> directive somewhere saying:
>
> @metanames{qw( msgtitle authorname )}
> =~ /<H1>[list-name] (.*)</H1>\w+<B>(.*)</B>/g;
>
> or something like that?
>
If you're using the spider.pl or DirTree.pl with -S prog, then yes, you
can filter the content with a regex and output additional <meta> tags
with the content.
See the filter_content callback in spider.pl and (IIRC) there's
something similar in DirTree.pl.
See also SWISH::Prog on CPAN for building your own -S prog programs.
--
Peter Karman . peter(at)not-real.peknet.com . http://www.peknet.com/
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Sat Sep 8 20:36:22 2007