Features planned for 3.0
Swish-e 3.0 (abbreviated Swish3) will be a complete overhaul of the code. You can track development progress here. Major feature improvements will include:
- Unicode support
- Unicode is the international standard
for character encodings. Swish3 will implement
support for the UTF-8
which should handle all major languages in the world (UTF-8 handles up to
2,147,483,648 unique characters).
The Swish-e developers need input from non-English language experts.
Please contribute to the discussion at the
Swish-e mailing list.
Some significant known issues include:
- lowercase vs. UPPERCASE
- Version 2.x uses tolower() to lowercase all characters before searching and indexing. Should the same approach be used for UTF-8? Will this have significant impact on usability for non-English languages?
- Version 2.x uses an internal table to support wildcard searching with *. The table assumes 8-bit (non-Unicode) character encoding. That approach will likely need to be re-thought for multibyte encodings like UTF-8.
- Version 2.x uses 5 different configuration options to control how a 'word' (token) is defined. The basic assumption is that a word is defined by which characters it includes. That assumption is based on a manageable character set of 256 characters. However, the sheer size of UTF-8 makes that system unworkable. Instead, some kind of regular expression library will likely be used.
- The stemmers used will need full international support.
- Configuration format
- Since Swish-e depends on a configuration file for StopWords, Character definitions, etc., the parsing of the configuration file must support UTF-8 as well. The current idea is to switch to XML-style configuration files and use Libxml2 to parse them.
- Incremental indexing
- Swish3 will support true incremental indexing. This will allow for document records to be modified, added and deleted in an existing index. This feature may or may not build on the version 2.x experimental btree/incremental feature.
- Swish3 will reliably scale to larger (multimillion) document collections.
- Indexing API
- Swish3 will include an indexing API in addition to the current searching API.
- Streamlined feature set
- Swish3 will not contain several features in the current version:
- Expat parsers
- -S http indexing method and related configuration options
- Older stemmers
- Current native index format
- Alternate index backends
- Swish3 will offer alternate index backends using available open source libraries, such as Xapian, HyperEstraier, Lucene, or Lemur.