Bill> Which index file? index.swish-e or index.swish-e.prop? If the .prop file
> is the one that is smaller that may be due to zlib compression.
They both suddenly got smaller, but the one I reported as dropping from
35.9 to 25.5 meg was index.swish-e.
Bill> Jose is constantly working to compress the index file, so although I can't
> remember a specific change, it's possible you are seeing the results of his
> efforts.
>
Jose> I have
> spent a lot of effort on this issue in the very last versions.
> How small it should be relays on the type of docs. Also, Bill
> added zlib support for properties.
I could use some simplifying translation here. What is puzzling me is
not that the size dropped _when_ we moved to swish-e 2.2, but that it
dropped at a subsequent time, when we hadn't made any updates to our
version. We put 2.2 in place and used it for several months. The index
file grew gradually to 35.9 meg; then the next week it was much smaller
with no evident change on _our_ end. I am convinced that no .html files
are being skipped.
So my question is this: Jose: Is there something in your compression
routines that could result in a decrease that large just by my _adding_
some files to be indexed? I'm hoping for some illumination in the form
of ideas about what could trigger such a fortuitous and dramatic result.
(For example: One simplistic theory is that you've got a compression
routine kicks in at 36 meg. Another is that compression that dramatic
could result from certain words being eliminated from the index because
they've crossed some threshold. [That latter theory seems unlikely to
me: all I did was add about 20 pages to the website.])
Bill> Two hours seems like a long time to fetch 4000 files. I suppose you have a
> delay to keep from hitting your server too hard.
>
> If you use the spider.pl and the keep_alive feature then you should be able
> to spider much faster without much load on the server (depending on your
> available bandwidth, of course).
Thanks! I'll check into these hints.
The estimate of 4000 files is low because I haven't quite been able to
get the robot to quit indexing a lot of files that are generated
on-the-fly by some .pl programs I've got in place, even when I use
no-index and no-follow. I'm not sure what I'm doing wrong, but the two
hours is only a small annoyance in the scheme of things, so I never
really worked on this matter. There are some .pl files I _do_ want to
index, and the other ones I filter out of the Results page seen by the
user.
Although I believe I didn't change anything along these lines in the
last round, I'm thinking that one theory of the smaller index file size
could possibly be related to these .pl files. Maybe something I did
_did_ cause the .pl files to no longer be indexed. I can certainly
believe that that might cause a precipitous drop in the index file size.
If Jose has no immediate ideas, I'll do another run and look at the
log file to see if this is the answer!
Lauren
Received on Fri Sep 27 12:38:09 2002