Skip to main content.
home | support | download

Back to List Archive

Re: Parsing a hypermail archive to exclude headers and footers

From: David L Norris <dave(at)>
Date: Thu Oct 09 2003 - 19:12:32 GMT
On Thu, 2003-10-09 at 13:32, Kissman, Paul (BLC) wrote:
> I have a newbie question.
> I have started to create hypermail archives of our majordomo lists in
> order to be able to search them via Swish-E.  (swish-e 2.2.3)

> I can't figure out if there is a way to have swish-e just index this
> part of the document or not.

You might want to look at script included with

Also, below I've included the SWISH-E config I use to index my hypermail
archives with SWISH-E 2.4.  Maybe you can adapt it to your needs.

# Rewrite the files to play nice with our meta data
FileFilter .html /usr/bin/perl "-p -e 's@<!-- body=\"start\" -->@<!--
body=\"start\" --><div>@g;s@<!-- body=\"end\" -->@</div><!--
body=\"end\" -->@g;s@<pre>@<pre><div>@g;s@</pre>@</div></pre>@g' '%p'"
FileRules filename regex /author\.html/
FileRules filename regex /index\.html/
FileRules filename regex /thread\.html/
FileRules filename regex /subject\.html/
DefaultContents HTML2
IndexOnly .html
IndexContents HTML2 .html
PropertyNames author subject date serial
PropertyNamesDate epoch
#PropertyNamesNumeric serial
MetaNames swishtitle swishdescription author subject date epoch serial
PresortedIndex serial epoch
StoreDescription HTML2 <div> 128
IgnoreMetaTags <dl> <dt> <dd> <ul> <li> <strong>
MetaNamesRank 1 author
MetaNamesRank 1 epoch

 David Norris
  ICQ - 412039
Received on Thu Oct 9 19:17:30 2003