Re: Indexing differs for 2 lines swapped in file

From: <moseley(at)>
Date: Sat Oct 25 2003 - 16:02:14 GMT
On Fri, Oct 24, 2003 at 11:23:03PM -0700, Dominique Phommahaxay wrote:

> . swapping 2 records in a file leads to different indexing output and search results (which is incorrect).

Yes, if that's true.

> . the search is returning proper results as long as the indexing is properly processed.

Sounds logical.

> 5. What now?
> ============
> How can I help solving this issue (providing assistance, uploading files for test -- they are huge files...)? Please advise.

moseley@bumby:~$ wc -l books.txt
15650 books.txt

moseley@bumby:~$ ls -l books.txt
-rw-r--r--    1 moseley  moseley   1815405 2003-10-25 07:50 books.txt

moseley@bumby:~$ wc -l books.txt
15650 books.txt

moseley@bumby:~$ fgrep -n J2Ee books.txt 
15650:0672317958|PAP|Building Java Enterprise System With J2Ee|BOOK & CD|COM|ENG|Perrone,Paul J./ Chaganti, Venkata S.R.R.|...

moseley@bumby:~$ swish-e -i books.txt
Indexing Data Source: "File-System"
Indexing "books.txt"
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 19 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
19 unique words indexed.
4 properties sorted.                                              
1 file indexed.  1815405 total bytes.  297351 total words.
Elapsed time: 00:00:23 CPU time: 00:00:23
Indexing done!

moseley@bumby:~$ swish-e -w j2ee
# SWISH format: 2.4.0-pr4
# Search words: j2ee
# Removed stopwords: 
# Number of hits: 1
# Search time: 0.001 seconds
# Run time: 0.042 seconds
1000 books.txt "books.txt" 1815405

moseley@bumby:~$ swish-e -w j2e\*
# SWISH format: 2.4.0-pr4
# Search words: j2e*
# Removed stopwords: 
# Number of hits: 1
# Search time: 0.001 seconds
# Run time: 0.043 seconds
1000 books.txt "books.txt" 1815405

So that doesn't help you very much.

So you are saying that J2Ee is not being index (or not being found while 
searching, right?

So first thing, see if it's being indexed:

moseley@bumby:~$ swish-e -i books.txt -T indexed_words | grep -i j2ee
    Adding:[1:swishdefault(1)]   'j2ee'   Pos:297340  Stuct:0x9 ( BODY FILE )

If that doesn't work, then look into the detail of -T index_words 
output.  That will likely give you a clue.

If that does work then paste the above output and your search string
into your next email.  Try and find the *smallest* possible file that
will show the problem, but only after doing the above with -T
indexed_words check, and make that available.

Bill Moseley
Received on Sat Oct 25 16:16:50 2003