Skip to main content.
home | support | download

Back to List Archive

Re: Word Frequency List

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Wed Feb 16 2005 - 17:37:20 GMT
I think I posted a perl script awhile back that did this, albeit in a crude way.

If all you want is the non-noise words, however, I'd suggest you play with the 
IgnoreWords config option. Set it for something like 50%, so that you ignore 
words that appears in over half the collection.

otherwise, the raw word data can be dumped with -T index_words and you can parse 
it as you like -- here's one example (DISCLAIMER -- I haven't tested this in 
awhile...)

#!/usr/bin/perl -w
#
# count instances of words in a swish-e index
# and report on NUM number of top instances
#
# usage: countwords [NUM [INDEX]]

use strict;
use Text::FormatTable;

my ($num,$index) = @ARGV;

#defaults
$index ||= 'index.swish-e';
$num ||= 50;

my $count;
my $cmd = "swish-e -f $index -T INDEX_WORDS";

open(SWISH, "$cmd |")
         or die "can't exec '$cmd': $!\n";

while(<SWISH>) {
         chomp;
         my ($word,@insts) = split /\[\d+ /, $_ ;
         INST: for my $i (@insts) {
                 next INST if ! $i;
                 my ($doc,$cnt) = split(/\s+/,$i);
                 $count->{$word}->[0] += $cnt;
                 $count->{$word}->[1]++;
         }
}

close(SWISH);

# print results, stopping at $num
# use FormatTable for pretty ASCII

my $tbl = new Text::FormatTable('r  l  l');
$tbl->head('word','count','unique docs');
$tbl->rule('=');
my $seen = 0;

for my $word (sort {
         $count->{$b}->[0] <=> $count->{$a}->[0]
         } keys %$count) {
         my ($cnt,$docs) = @{ $count->{$word} };
         $tbl->row($word, $cnt, $docs);
         last if ++$seen == $num;
}

print $tbl->render(60);


Tac Tacelosky scribbled on 2/16/05 11:21 AM:
> Is there a way to get a list of the words in an index, and their frequency?
> (Ideally, ordered by frequency).  We're looking at a very rough way of
> saying "What are the non-noise words that show up in this collection?", and
> thought it might possible to use swish-e.  Maybe something like -k, but with
> frequency and order by most frequent words first.
> 
> Probably not, but thought I'd ask anyway.
> 
> Thx,
> 
> Tac
> 

-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
Received on Wed Feb 16 09:37:21 2005