Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] Aid with Swish3 Unicode feature

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Mon Jan 07 2008 - 15:51:16 GMT
On 01/06/2008 04:37 PM, Itamar Syn-Hershko wrote:
> Hi all,
>  
> I'm a C++ developer, and found Swish-e not too long ago while
> researching the net for an indexing service (or algorithm) I could use
> for a private project. With this project, I'm aiming on providing a good
> tool for indexing content and make the index files portable and
> searchable by accompanying software. This application of mine should
> take into account it is going to be run under possibly weeker systems
> and from a CD-rom drive (occasionally).
>  

Hi Itamar,

It sounds like Xapian might be more up your alley: http://www.xapian.org/

Swish3 will use it as one possible backend.

>  
> I was wondering whether someone could explain in simple what is the
> index file look like in detail - whats the data structure the words and
> their related info are being stored in, and the reading process in
> short. I have it half-figured by now, but the whole thing of COMPRESS
> and DECOMPRESS got me lost... (which I would also appreciate if someone
> would explain in short). After I will see how Swish-e does it, I will
> either claim mine is better and share, or use that approach myself and
> perhaps tweak it...

The Swish 2.4.x index version comes in 2 flavors: the 'native' (default) format and the
'btree' format. Neither of them are well documented (as you have discovered), and the
btree format is still labeled experiemental.

The 2.6 branch (http://svn.swish-e.org/swish-e/branches/2.6/) uses Berkeley DB as a
backend. I think that code would be easier to grok, and it supports the incremental
features that 2.4 native does not.

I would guess that the compresss/decompress stuff you are seeing is for the properties
file, which functions semi-independently of the index proper. The properties file just
stores parsed textual content (often compressed) from the original document collection for
later retrieval. It is not used in searches at all; just for reporting results.

> Last but not least, how does the Hashing function used in Swish-e (at
> least with 2.x) work, and would it work properly for both English and
> Hebrew words with no hash collision?
>  

2.4 only supports single-byte encodings, so that version seems like a non-starter for you,
but in any case, I don't know enough about the hashing functions in 2.4 to answer.


> BTW, have you tried ICU yet? (it has C libraries afaik, and also a regex
> library): http://www.icu-project.org/.

I did look at ICU for Swish3 but rejected it because it seemed too large. But I may
re-visit that decision eventually, and there is currently support in libswish3 for
alternate tokenizers.

> Also, as far as HTML/XML tokenizers for the indexing process, you should
> have a look at this one:
> http://www.codeproject.com/cpp/HTML_XML_Scanner.asp.
>  

That's cool.

Swish3 is using libxml2 for more than just parsing; it has buffer, hashing, iconv and i/o
features that are helpful too.

-- 
Peter Karman  .  peter(at)not-real.peknet.com  .  http://peknet.com/

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Mon Jan 7 10:51:16 2008