Skip to main content.
home | support | download

Back to List Archive

Different number of indexed words when indexing large mount of data

From: Rodolfo Martinez <macr111080(at)not-real.yahoo.com.mx>
Date: Thu Apr 20 2006 - 23:23:46 GMT
Hi list,

I am seeing a "strange" behavior when indexing a large amount of data.
(~22GB including images, PDF files, MS Word files, but only .htm, .html
and .txt files are indexed (~157,000 files)).

The problem is that I am getting 2 different number of indexed words
from the same data; for example, some output lines after execute:
swish-e c swish.config
(you can see the config file at the end of this email) are:

==========-Output1 begin==========
Parsing config file 'swish.conf'
Indexing Data Source: "File-System"
Indexing "../disk2/Info"
..
In dir "../disk2/Info/ebsp/apac/cn":
  benefits.htm - Using HTML2 parser -  (43 words)
..
617,504 unique words indexed.
Sorting property: swishdocpath
Sorting property: swishtitle
Sorting property: swishdocsize
Sorting property: swishlastmodified
Sorting property: swishdescription
5 properties sorted.                                              
157,686 files indexed.  2,965,919,331 total bytes.  121,765,998 total words.
Elapsed time: 01:21:49 CPU time: 00:13:25
Indexing done!
==========Output1 end==========

Other Output from the same command:

==========Output2 begin==========
Parsing config file 'swish.conf'
Indexing Data Source: "File-System"
Indexing "../disk2/Info"
..
In dir "../disk2/Info/ebsp/apac/cn":
  benefits.htm - Using HTML2 parser -  (40 words)
..
617,584 unique words indexed.
Sorting property: swishdocpath
Sorting property: swishtitle
Sorting property: swishdocsize
Sorting property: swishlastmodified
Sorting property: swishdescription
5 properties sorted.                                              
157,686 files indexed.  2,965,919,331 total bytes.  121,765,813 total words.
Elapsed time: 01:26:31 CPU time: 00:12:59
Indexing done!
==========Output2 end==========

As you can see, the number of unique indexed words and total words are
different.

After the indexing process is finished I extract the keywords with the
command:
swish-e -k* > swish_keyword.out
and I realized that there is a pattern in the keyword files' size
for example:
macr@linux:~/SearchEngine/Golden> ls l
-rw-r--r--  1 macr users 5172418 2006-04-10 08:51 swish_keyword.out1
-rw-r--r--  1 macr users 5173104 2006-04-10 09:06 swish_keyword.out2
-rw-r--r--  1 macr users 5172418 2006-04-10 09:17 swish_keyword.out3
-rw-r--r--  1 macr users 5173104 2006-04-10 10:08 swish_keyword.out4

Notice that output file 1 is equal to output file 3 and output file 2 is
equal to output file 4. This pattern is consistent if I continue indexing
and extracting the keywords.

I've only seen this behavior when indexing all the information;
if I index just a few directories I got the same number of indexed
words always.

Here is my system description:
OS: SuSE Linux Enterprise Server 9 Service Pack 3(kernel-2.6.5-smp)
CPU: Intel (R) Pentium 4 (3.00 GHz with  HT)
RAM: 1Gb
SWISH-E 2.4.3
libxml2-2.6.7

And my swish.config file is:
========= swish.config begin ==========
IndexReport   3
IndexDir   ../disk2/Info
IndexOnly   .htm  .html  .txt
IndexContents  TXT2  .txt
DefaultContents  HTML2
StoreDescription  HTML2 <body> 80
#Filesystem in ../disk2 is ext3
ReplaceRules replace "../disk2/Info"
========= swish.config end ==========

Any idea why this is happening?

Best Regards,

Rodolfo.


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 
Received on Thu Apr 20 16:23:58 2006