I think I posted a perl script awhile back that did this, albeit in a crude way.
If all you want is the non-noise words, however, I'd suggest you play with the
IgnoreWords config option. Set it for something like 50%, so that you ignore
words that appears in over half the collection.
otherwise, the raw word data can be dumped with -T index_words and you can parse
it as you like -- here's one example (DISCLAIMER -- I haven't tested this in
awhile...)
#!/usr/bin/perl -w
#
# count instances of words in a swish-e index
# and report on NUM number of top instances
#
# usage: countwords [NUM [INDEX]]
use strict;
use Text::FormatTable;
my ($num,$index) = @ARGV;
#defaults
$index ||= 'index.swish-e';
$num ||= 50;
my $count;
my $cmd = "swish-e -f $index -T INDEX_WORDS";
open(SWISH, "$cmd |")
or die "can't exec '$cmd': $!\n";
while(<SWISH>) {
chomp;
my ($word,@insts) = split /\[\d+ /, $_ ;
INST: for my $i (@insts) {
next INST if ! $i;
my ($doc,$cnt) = split(/\s+/,$i);
$count->{$word}->[0] += $cnt;
$count->{$word}->[1]++;
}
}
close(SWISH);
# print results, stopping at $num
# use FormatTable for pretty ASCII
my $tbl = new Text::FormatTable('r l l');
$tbl->head('word','count','unique docs');
$tbl->rule('=');
my $seen = 0;
for my $word (sort {
$count->{$b}->[0] <=> $count->{$a}->[0]
} keys %$count) {
my ($cnt,$docs) = @{ $count->{$word} };
$tbl->row($word, $cnt, $docs);
last if ++$seen == $num;
}
print $tbl->render(60);
Tac Tacelosky scribbled on 2/16/05 11:21 AM:
> Is there a way to get a list of the words in an index, and their frequency?
> (Ideally, ordered by frequency). We're looking at a very rough way of
> saying "What are the non-noise words that show up in this collection?", and
> thought it might possible to use swish-e. Maybe something like -k, but with
> frequency and order by most frequent words first.
>
> Probably not, but thought I'd ask anyway.
>
> Thx,
>
> Tac
>
--
Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com
Received on Wed Feb 16 09:37:21 2005