Skip to main content.
home | support | download

Back to List Archive

Re: Unexpected index file size reduction

From: <jmruiz(at)not-real.boe.es>
Date: Fri Sep 27 2002 - 15:27:03 GMT
Hi Lauren,

On 27 Sep 2002, at 5:34, Lauren wrote:

> I could use some simplifying translation here.  What is puzzling me is
> not that the size dropped _when_ we moved to swish-e 2.2, but that it
> dropped at a subsequent time, when we hadn't made any updates to our
> version.  We put 2.2 in place and used it for several months.  The
> index file grew gradually to 35.9 meg; then the next week it was much
> smaller with no evident change on _our_ end. I am convinced that no
> .html files are being skipped.
> 
> So my question is this: Jose: Is there something in your compression
> routines that could result in a decrease that large just by my
> _adding_ some files to be indexed?  I'm hoping for some illumination
> in the form of ideas about what could trigger such a fortuitous and
> dramatic result. 
>   (For example: One simplistic theory is that you've got a compression
>   

So, all your files are HTML files, right?

Well, It is correct. I added quite recently a feature to compress
structure (IN_BODY, IN_FILE, etc...) in a better way. I have used a
technique that uses a bit flag to indicate that the word is only
in the body (this seems to be a very common case). When
this occurs 1 byte per position is saved. So, if the word
occurs 7 times in the file and it is only in the body, then we have
saved 7 bytes.

cu
Jose
Received on Fri Sep 27 15:30:42 2002