Hi all, I'm back
There is a new alpha (non stable) version of swish-e:
http://www.boe.es/swish-e/alpha/swish-e-2.1.4.tar.gz
It fixes several bugs of previous 2.1.X plus:
- More info with extended header (option -x): Now, for each index file it adds the
header of each index file and, in the results lists two new parameters are
added (indexfile and offset). For compatibility with previous versions (1.3x and
2.0.x), the offset is removed from the standard (without -x) results line.
- Better economic mode (option -e) to save more memory in the index proccess.
- File index are smaller. I have compressed part of the word info using
lookuptable techniques (frequency, metaname and structure
are small repetitive data). So less I/O is required in search. The tests I have
made, show a reduction of about 8-10% in the index file for non html files. For
this reason, the old index files of 2.1.X need to be reindex.
- These are the new options in config file:
DefaultContentType [HTML|TXT|XML] (*)
Indexcontents [HTML|TXT|XML] .fileext1 .fileext2 (*)
BumpPositionCounterCharacters string
(*) Only for FS at this moment
TODO:
- First of all, fix your reported bugs
- Let the words "and", "or" and "not" be in a phrase (reported by Bill). (This
also need to be applied to 2.0.x).
- Add the ability of returning the header info for each index file to the C library
(equivalent to option -x).
- Built in C html spider to make things much easier. This can avoid perl.
- Make stemmer.c thread safe
- Make index file smaller. Properties can make your index files really big. A
deflate compression scheme will make it smaller but the well known zlib's
library deflate compression format does not allow direct access... I have very
big index files because the properties are using a lot of space (55 % of the total
size of the file). Consider that if you put all your info from your documents in
the index file as properties, you do not need to access the file to get the data.
This can be a good idea to distribute all your data in, for example, a CDROM,
without including the files themselves.
- What about the new soundex.c posted early? Should I add it?
- Otion -k (will return all words of an index file starting with...). Eg: "swish-e -k
ac -f index.file" will returns all indexed words starting with ac.
If a miss something let me know
BTW, I will be at APACHECON{HYPERLINK "http://apachecon.com/"} Europe, in London, during the next week.
cu
Jose
Received on Fri Oct 13 09:24:21 2000