Skip to main content.
home | support | download

Back to List Archive

[swish-e] Some stats for index.

From: Cedric Jeanneret <cjeanneret(at)not-real.internux.ch>
Date: Fri Apr 18 2008 - 15:11:37 GMT
Hi again!

As discussed on IRC, here are some stats of my swish-e (swish configs
are in fact those I gave some hours ago, you can find them at the end of
this mail)


Wikis (moin-moin, through filesystem. NO http access are done for this
indexation):

Sorting 53,352 words alphabetically
3,352 unique words indexed.
Sorting property: swishdocpath                            
Sorting property: swishtitle                              
Sorting property: swishdocsize                            
Sorting property: swishlastmodified                       
Sorting property: swishdescription                        
5 properties sorted.                                              
2,091 files indexed.  16,440,593 total bytes.  926,703 total words.
Elapsed time: 00:00:08 CPU time: 00:00:08

hardware/software spec :
Debian 3.1, kernel 2.6.16 on bi Intel(R) Xeon(TM) CPU 2.80GHz, 2025Mo RAM


Filesystem :

Sorting 2,618,998 words alphabetically
2,618,998 unique words indexed.
5 properties sorted.                                              
26,361 files indexed.  3,559,089,581 total bytes.  53,755,796 total words.
Elapsed time: 00:35:18 CPU time: 00:08:17

hardware/software spec :
Debian 4.0, kernel 2.6.18 on octo-Intel(R) Xeon(R) CPU E5420  @ 2.50GHz, 7964Mo RAM
!! it'sa Virtual Environment which shares system properties with 9 other VE.


Request Tracker (pgsql through LAN):

Sorting 59,507 words alphabetically
59,507 unique words indexed.
Sorting property: swishdocpath                            
Sorting property: swishtitle                              
Sorting property: swishdocsize                            
Sorting property: swishlastmodified                       
Sorting property: swishdescription                        
5 properties sorted.                                              
13,437 files indexed.  32,635,567 total bytes.  1,714,068 total words.
Elapsed time: 00:00:34 CPU time: 00:00:26

hardware/software spec:
Debian 4.0, kernel 2.6.18 openVZ, on a bi Intel(R) Pentium(R) 4 CPU 2.80GHz, 2023Mo RAM
!! it's a Virtual Environment which shares system properties with 8 other VE
Database is a posgres one, with about 7'000 tickets.


I don't know if it can be really significant, but... maybe yes ;)

Regards

C. Jeanneret


Jeanneret Internux      cjeanneret@internux.ch
Av. des Alpes 123       +41 78 748 03 02
1814 La Tour-de-Peilz   +41 21 550 02 09

>
> Walk through a fileserver, with opendocument + ms-office + pdf support :
>
> # replace /path/to/files with local mount point
> ReplaceRules regex #/path/to/files/#/mnt/local_mount_point/#ig
>
> IndexOnly .txt .htm .html .pdf .ods .odt .odp .doc .xls .ppt .pps .sxw .sxc .sxg .xml
> IndexDir /path/to/files/
>
> FileRules filename contains /.\$/
> IndexFile /path/to/index/fileserver.swish-e
> MinWordLimit 3
>
>
> # XML files and associated
> FileFilterMatch "/usr/bin/unzip" "-p %p content.xml" /\.(sxw|sxc|sxg|ods|odt|odp)$/i
>
> IndexContents XML* .sxw .sxc .sxg .ods .odt .xml .odp
> StoreDescription XML* <office:body> 20000
>
> IndexContents TXT* .xls .doc .pps .ppt .txt .pdf
> StoreDescription TXT* 20000
>
> IndexContents HTML* .html .htm
> StoreDescription HTML* <body> 20000
>
> # DOC files
> FileFilterMatch "/usr/bin/catdoc" "-b %p | recode -p -q -f ..latin1" /\.(doc)$/i
> # XLS files
> FileFilterMatch "/usr/bin/xls2csv" "-x %p | recode -p -q -f ..latin1" /\.(xls)$/i
> # PPT/PPS
> FileFilterMatch "/usr/bin/catppt" " %p | recode -p -q -f ..latin1" /\.(ppt|pps)$/i
> # PDF files
> FileFilterMatch "/usr/bin/pdftotext" " -q %p - | recode -p -q -f ..latin1" /\.(pdf)$/i
>
> ___________________________________________________________________________________
>
> RT Spider :
>
> #!/usr/bin/perl -w
> use strict;
>
> use DBI;
> use Compress::Zlib;
> use Time::Local;
> use Locale::Recode;
>
>
> my $dbh = DBI->connect( "dbi:Pg:dbname=rtdb;host=HOST","USER","PASSWORD", { RaiseError => 1 } );
>
> my $sth = $dbh->prepare("select ti.id,ti.subject,at.content,at.created from tick
> ets ti, transactions tr, attachments at where ti.status <> 'deleted' and tr.obje
> ctid=ti.id and at.transactionid=tr.id and at.contenttype like 'text/%' and (tr.t
> ype= 'Comment' or tr.type = 'CommentEmailRecord' or tr.type = 'Create')");
>
> $sth->execute();
>
> while ( my( $id, $title,$ticket,$date) = $sth->fetchrow_array ) {
>
>   my $uncompressed = uncompress( $ticket );
>   my $unix_date = unixtime( $date );
>
>   my $cd = Locale::Recode->new (from => 'UTF-8', to => 'ISO-8859-15');
>   $cd->recode($ticket);
>
>   my $content = <<EOF;
> <html>
> <head>
> <title>
> RT - $title
> </title>
> <meta http-equiv="content-type" content="text/html;charset=iso-8859-15" />
> </head>
> <body>
> $ticket
> </body>
> </html>
> EOF
>
>
>   my $length = length $content;
>
>   print <<EOF;
> Content-Length: $length
> Last-Mtime: $unix_date
> Path-Name: http://mydomain.wxt/Ticket/Display.html?id=$id
> Document-Type: HTML
>
> EOF
>   print $content;
>
> }
>
> sub unixtime {
>   my ( $y, $m, $dh ) = split /-/, shift;
>   my ($d, $hms) = split / /, $dh;
>   my ($h,$i,$s) = split /:/,$hms;
>   return timelocal($s,$i,$h,$d,$m-1,$y-1900);
> };
>
> swish.conf :
>
> IndexFile /path/to/rt.swish-e
>
> DefaultContents HTML
> StoreDescription HTML <body> 200000
> MetaNames swishdocpath swishtitle
>
> MinWordLimit 3
>
>
> Command line to run this :
>
> swish-e -c /path/to/config/file/swish.conf -S prog -i /path/to/rt_spider.pl
>
> _______________________________________________________________________
>
> Moin-moin wiki indexer (through filesystem)
>
> #!/usr/bin/perl
> use File::Find;
> use Locale::Recode;
> use strict;
>
> sub wanted {
>     return if -d;
>     return unless /text_html$/;
>
>     my $mtime  = (stat)[9];
>
>     my $child = open( FH, "< $_" ) or die($!);
>
>     my $content = '';
>     while(my $l = <FH>) {
>         chomp($l);
>         $content .= $l;
>     }
>     close FH;
>
>     my $cd = Locale::Recode->new(from => 'UTF-8', to => 'ISO-8859-15');
>     $cd->recode($content);
>     $content = "<body>$content</body>";
>     
>     my $size = length $content;
>
>     print <<EOF;
> Content-Length: $size
> Last-Mtime: $mtime
> Path-Name: $_
>
> EOF
>     print "$content";
> }
>
> find({ wanted => \&wanted, no_chdir => 1, },'.', );
>
>
> swish config file :
>
> IndexFile /path/to/my/indexes/all_wikis.swish-e
>
> DefaultContents HTML*
> StoreDescription HTML* <body> 200000
> ConvertHTMLEntities yes
>
> MinWordLimit 2
>
> ReplaceRules regex !^.*/doc/wikis/!!
> ReplaceRules remove data/
> ReplaceRules remove cache/
> ReplaceRules remove pages/
> ReplaceRules remove /text_html
> ReplaceRules remove /pagelinks
>
> ReplaceRules replace \(2f\) \/
> ReplaceRules replace \(2e\) \.
> ReplaceRules replace \(2d\) \-
>
> ReplaceRules regex /\(([a-z0-9]{2})([a-z0-9]{2})\)/%$1%$2/gi
> ReplaceRules prepend 'http://my.domain.org/'
>
>
> Command line :
>
> /path/to/swish_filter/filter.pl | swish-e -c /path/to/swish-wiki.config -i stdin -S prog
>   
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Fri Apr 18 11:11:32 2008