Skip to main content.
home | support | download

Back to List Archive

Re: SWISH text analysis package and frequency analysis for bibliographic records

From: <moseley(at)not-real.hank.org>
Date: Tue Jul 15 2003 - 15:02:18 GMT
> On Tuesday, July 15, 2003, at 04:19 AM, michelle jenkins wrote:

> > 1. The software must be able to count the occurrence
> > of each word in each record in a number of fields:

Yes, the index stores word frequency (and word position) based on 
filename and metaname (field).

> > 2. The software must be able to count the record
> > occurrence (the total number of unique records that
> > contain each word).

Well, yes, that's what swish does.  You search for a word and it tells 
you the list of files that contain that word.

> > 3. The software must be able to identify frequently
> > occurring phrases (ideally including hyphenated words)
> > or word co-occurrence within records and fields

You can search for phrases, but it does that just by matching words 
based on their word position.  No pre-processing of phrases is done at 
indexing time.

> > 4. The software must be able to allow the import of
> > MEDLINE records consisting of title, abstract, journal
> > and MeSH

No problem.


> > 5.      The software must be able to remove stop words
> > at the userís   discretion

Yes.

> > Obviously I'm hoping to evaluate the packages myself
> > before deciding. Previous research has used WordStat,
> > the bibliographic software Idealist and SWISH in a
> > hpertext/fulltext environemnt. One of the major
> > limitations of these packages was their inability to
> > analyse phrases (mutli-term controlled vocabulary).

Swish has a "buzzword" feature that might work in some cases, although I 
don't think you can use it for phrases that contain white space.  It's 
really more useful for words that contain characters that wouldn't 
normally be indexed (e.g. C++).


-- 
Bill Moseley
moseley@hank.org
Received on Tue Jul 15 15:02:31 2003