Skip to main content.
home | support | download

Back to List Archive

[swish-e] Differents config + script support

From: Cedric Jeanneret <cjeanneret(at)not-real.internux.ch>
Date: Fri Apr 18 2008 - 11:08:47 GMT
Hi!
I had to create some special config files for swish-e (fileserver
indexer and moin-moin wiki), and a spider for Request tracker.
Here they are :

Walk through a fileserver, with opendocument + ms-office + pdf support :

# replace /path/to/files with local mount point
ReplaceRules regex #/path/to/files/#/mnt/local_mount_point/#ig

IndexOnly .txt .htm .html .pdf .ods .odt .odp .doc .xls .ppt .pps .sxw .sxc .sxg .xml
IndexDir /path/to/files/

FileRules filename contains /.\$/
IndexFile /path/to/index/fileserver.swish-e
MinWordLimit 3


# XML files and associated
FileFilterMatch "/usr/bin/unzip" "-p %p content.xml" /\.(sxw|sxc|sxg|ods|odt|odp)$/i

IndexContents XML* .sxw .sxc .sxg .ods .odt .xml .odp
StoreDescription XML* <office:body> 20000

IndexContents TXT* .xls .doc .pps .ppt .txt .pdf
StoreDescription TXT* 20000

IndexContents HTML* .html .htm
StoreDescription HTML* <body> 20000

# DOC files
FileFilterMatch "/usr/bin/catdoc" "-b %p | recode -p -q -f ..latin1" /\.(doc)$/i
# XLS files
FileFilterMatch "/usr/bin/xls2csv" "-x %p | recode -p -q -f ..latin1" /\.(xls)$/i
# PPT/PPS
FileFilterMatch "/usr/bin/catppt" " %p | recode -p -q -f ..latin1" /\.(ppt|pps)$/i
# PDF files
FileFilterMatch "/usr/bin/pdftotext" " -q %p - | recode -p -q -f ..latin1" /\.(pdf)$/i

___________________________________________________________________________________

RT Spider :

#!/usr/bin/perl -w
use strict;

use DBI;
use Compress::Zlib;
use Time::Local;
use Locale::Recode;


my $dbh = DBI->connect( "dbi:Pg:dbname=rtdb;host=HOST","USER","PASSWORD", { RaiseError => 1 } );

my $sth = $dbh->prepare("select ti.id,ti.subject,at.content,at.created from tick
ets ti, transactions tr, attachments at where ti.status <> 'deleted' and tr.obje
ctid=ti.id and at.transactionid=tr.id and at.contenttype like 'text/%' and (tr.t
ype= 'Comment' or tr.type = 'CommentEmailRecord' or tr.type = 'Create')");

$sth->execute();

while ( my( $id, $title,$ticket,$date) = $sth->fetchrow_array ) {

  my $uncompressed = uncompress( $ticket );
  my $unix_date = unixtime( $date );

  my $cd = Locale::Recode->new (from => 'UTF-8', to => 'ISO-8859-15');
  $cd->recode($ticket);

  my $content = <<EOF;
<html>
<head>
<title>
RT - $title
</title>
<meta http-equiv="content-type" content="text/html;charset=iso-8859-15" />
</head>
<body>
$ticket
</body>
</html>
EOF


  my $length = length $content;

  print <<EOF;
Content-Length: $length
Last-Mtime: $unix_date
Path-Name: http://mydomain.wxt/Ticket/Display.html?id=$id
Document-Type: HTML

EOF
  print $content;

}

sub unixtime {
  my ( $y, $m, $dh ) = split /-/, shift;
  my ($d, $hms) = split / /, $dh;
  my ($h,$i,$s) = split /:/,$hms;
  return timelocal($s,$i,$h,$d,$m-1,$y-1900);
};

swish.conf :

IndexFile /path/to/rt.swish-e

DefaultContents HTML
StoreDescription HTML <body> 200000
MetaNames swishdocpath swishtitle

MinWordLimit 3


Command line to run this :

swish-e -c /path/to/config/file/swish.conf -S prog -i /path/to/rt_spider.pl

_______________________________________________________________________

Moin-moin wiki indexer (through filesystem)

#!/usr/bin/perl
use File::Find;
use Locale::Recode;
use strict;

sub wanted {
    return if -d;
    return unless /text_html$/;

    my $mtime  = (stat)[9];

    my $child = open( FH, "< $_" ) or die($!);

    my $content = '';
    while(my $l = <FH>) {
        chomp($l);
        $content .= $l;
    }
    close FH;

    my $cd = Locale::Recode->new(from => 'UTF-8', to => 'ISO-8859-15');
    $cd->recode($content);
    $content = "<body>$content</body>";
    
    my $size = length $content;

    print <<EOF;
Content-Length: $size
Last-Mtime: $mtime
Path-Name: $_

EOF
    print "$content";
}

find({ wanted => \&wanted, no_chdir => 1, },'.', );


swish config file :

IndexFile /path/to/my/indexes/all_wikis.swish-e

DefaultContents HTML*
StoreDescription HTML* <body> 200000
ConvertHTMLEntities yes

MinWordLimit 2

ReplaceRules regex !^.*/doc/wikis/!!
ReplaceRules remove data/
ReplaceRules remove cache/
ReplaceRules remove pages/
ReplaceRules remove /text_html
ReplaceRules remove /pagelinks

ReplaceRules replace \(2f\) \/
ReplaceRules replace \(2e\) \.
ReplaceRules replace \(2d\) \-

ReplaceRules regex /\(([a-z0-9]{2})([a-z0-9]{2})\)/%$1%$2/gi
ReplaceRules prepend 'http://my.domain.org/'


Command line :

/path/to/swish_filter/filter.pl | swish-e -c /path/to/swish-wiki.config -i stdin -S prog


For information : I found a lot of hints in mailing list, so if you
think you already saw some of the features.. well, it's normal ;)

Hope this can help !

Regards

C. Jeanneret


-- 
Jeanneret Internux      cjeanneret@internux.ch
Av. des Alpes 123       +41 78 748 03 02
1814 La Tour-de-Peilz   +41 21 550 02 09

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Fri Apr 18 07:08:41 2008