Hi!
I had to create some special config files for swish-e (fileserver
indexer and moin-moin wiki), and a spider for Request tracker.
Here they are :
Walk through a fileserver, with opendocument + ms-office + pdf support :
# replace /path/to/files with local mount point
ReplaceRules regex #/path/to/files/#/mnt/local_mount_point/#ig
IndexOnly .txt .htm .html .pdf .ods .odt .odp .doc .xls .ppt .pps .sxw .sxc .sxg .xml
IndexDir /path/to/files/
FileRules filename contains /.\$/
IndexFile /path/to/index/fileserver.swish-e
MinWordLimit 3
# XML files and associated
FileFilterMatch "/usr/bin/unzip" "-p %p content.xml" /\.(sxw|sxc|sxg|ods|odt|odp)$/i
IndexContents XML* .sxw .sxc .sxg .ods .odt .xml .odp
StoreDescription XML* <office:body> 20000
IndexContents TXT* .xls .doc .pps .ppt .txt .pdf
StoreDescription TXT* 20000
IndexContents HTML* .html .htm
StoreDescription HTML* <body> 20000
# DOC files
FileFilterMatch "/usr/bin/catdoc" "-b %p | recode -p -q -f ..latin1" /\.(doc)$/i
# XLS files
FileFilterMatch "/usr/bin/xls2csv" "-x %p | recode -p -q -f ..latin1" /\.(xls)$/i
# PPT/PPS
FileFilterMatch "/usr/bin/catppt" " %p | recode -p -q -f ..latin1" /\.(ppt|pps)$/i
# PDF files
FileFilterMatch "/usr/bin/pdftotext" " -q %p - | recode -p -q -f ..latin1" /\.(pdf)$/i
___________________________________________________________________________________
RT Spider :
#!/usr/bin/perl -w
use strict;
use DBI;
use Compress::Zlib;
use Time::Local;
use Locale::Recode;
my $dbh = DBI->connect( "dbi:Pg:dbname=rtdb;host=HOST","USER","PASSWORD", { RaiseError => 1 } );
my $sth = $dbh->prepare("select ti.id,ti.subject,at.content,at.created from tick
ets ti, transactions tr, attachments at where ti.status <> 'deleted' and tr.obje
ctid=ti.id and at.transactionid=tr.id and at.contenttype like 'text/%' and (tr.t
ype= 'Comment' or tr.type = 'CommentEmailRecord' or tr.type = 'Create')");
$sth->execute();
while ( my( $id, $title,$ticket,$date) = $sth->fetchrow_array ) {
my $uncompressed = uncompress( $ticket );
my $unix_date = unixtime( $date );
my $cd = Locale::Recode->new (from => 'UTF-8', to => 'ISO-8859-15');
$cd->recode($ticket);
my $content = <<EOF;
<html>
<head>
<title>
RT - $title
</title>
<meta http-equiv="content-type" content="text/html;charset=iso-8859-15" />
</head>
<body>
$ticket
</body>
</html>
EOF
my $length = length $content;
print <<EOF;
Content-Length: $length
Last-Mtime: $unix_date
Path-Name: http://mydomain.wxt/Ticket/Display.html?id=$id
Document-Type: HTML
EOF
print $content;
}
sub unixtime {
my ( $y, $m, $dh ) = split /-/, shift;
my ($d, $hms) = split / /, $dh;
my ($h,$i,$s) = split /:/,$hms;
return timelocal($s,$i,$h,$d,$m-1,$y-1900);
};
swish.conf :
IndexFile /path/to/rt.swish-e
DefaultContents HTML
StoreDescription HTML <body> 200000
MetaNames swishdocpath swishtitle
MinWordLimit 3
Command line to run this :
swish-e -c /path/to/config/file/swish.conf -S prog -i /path/to/rt_spider.pl
_______________________________________________________________________
Moin-moin wiki indexer (through filesystem)
#!/usr/bin/perl
use File::Find;
use Locale::Recode;
use strict;
sub wanted {
return if -d;
return unless /text_html$/;
my $mtime = (stat)[9];
my $child = open( FH, "< $_" ) or die($!);
my $content = '';
while(my $l = <FH>) {
chomp($l);
$content .= $l;
}
close FH;
my $cd = Locale::Recode->new(from => 'UTF-8', to => 'ISO-8859-15');
$cd->recode($content);
$content = "<body>$content</body>";
my $size = length $content;
print <<EOF;
Content-Length: $size
Last-Mtime: $mtime
Path-Name: $_
EOF
print "$content";
}
find({ wanted => \&wanted, no_chdir => 1, },'.', );
swish config file :
IndexFile /path/to/my/indexes/all_wikis.swish-e
DefaultContents HTML*
StoreDescription HTML* <body> 200000
ConvertHTMLEntities yes
MinWordLimit 2
ReplaceRules regex !^.*/doc/wikis/!!
ReplaceRules remove data/
ReplaceRules remove cache/
ReplaceRules remove pages/
ReplaceRules remove /text_html
ReplaceRules remove /pagelinks
ReplaceRules replace \(2f\) \/
ReplaceRules replace \(2e\) \.
ReplaceRules replace \(2d\) \-
ReplaceRules regex /\(([a-z0-9]{2})([a-z0-9]{2})\)/%$1%$2/gi
ReplaceRules prepend 'http://my.domain.org/'
Command line :
/path/to/swish_filter/filter.pl | swish-e -c /path/to/swish-wiki.config -i stdin -S prog
For information : I found a lot of hints in mailing list, so if you
think you already saw some of the features.. well, it's normal ;)
Hope this can help !
Regards
C. Jeanneret
--
Jeanneret Internux cjeanneret@internux.ch
Av. des Alpes 123 +41 78 748 03 02
1814 La Tour-de-Peilz +41 21 550 02 09
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Fri Apr 18 07:08:41 2008