Skip to main content.
home | support | download

Back to List Archive

New alpha version swish-e-2.1.4

From: <jmruiz(at)not-real.boe.es>
Date: Fri Oct 13 2000 - 09:19:41 GMT
Hi all, I'm back

There is a new alpha (non stable) version of swish-e:

http://www.boe.es/swish-e/alpha/swish-e-2.1.4.tar.gz

It fixes several bugs of previous 2.1.X plus:

- More info with extended header (option -x): Now, for each index file it adds the 
header of each index file and, in the results lists two new parameters are 
added (indexfile and offset). For compatibility with previous versions (1.3x and 
2.0.x), the offset is removed from the standard (without -x) results line.

- Better economic  mode (option -e) to save more memory in the index proccess.

- File index are smaller. I have compressed part of the word info using 
lookuptable techniques (frequency, metaname and structure
are small repetitive data). So less I/O is required in search. The tests I have 
made, show a reduction of about 8-10% in the index file for non html files. For 
this reason, the old index files of 2.1.X need to be reindex.

- These are the new options in config file:
   DefaultContentType  [HTML|TXT|XML]    (*)
   Indexcontents [HTML|TXT|XML] .fileext1 .fileext2 (*)
   BumpPositionCounterCharacters string

(*) Only for FS at this moment

TODO:
- First of all, fix your reported bugs

- Let the words "and", "or"  and  "not" be in a phrase (reported by Bill). (This 
also need to be applied to 2.0.x).

- Add the ability of returning the header info for each index file to the C library 
(equivalent to option -x).

- Built in C html spider to make things much easier. This can avoid perl.

- Make stemmer.c thread safe

- Make index file smaller. Properties can make your index files really big. A 
deflate compression scheme will make it smaller but the well known zlib's 
library deflate compression format does not allow direct access... I  have very  
big index files because the properties are using a lot of space (55 % of the total 
size of the file). Consider that if you put all your info from your documents in 
the index file as properties, you do not need to access the file to get the data. 
This can be a good idea to distribute all your data in, for example, a CDROM, 
without including the files themselves.

- What about the new soundex.c posted early? Should I add it?

- Otion -k (will return all words of an index file starting with...). Eg:  "swish-e -k 
ac -f index.file" will returns all indexed words starting with ac.

If a miss something let me know

BTW, I will be at APACHECON{HYPERLINK "http://apachecon.com/"} Europe, in London, during the next week. 

cu
Jose
Received on Fri Oct 13 09:24:21 2000