Skip to main content.
home | support | download

Back to List Archive

Re: diff'ing indexes

From: Peter Karman <karman(at)not-real.cray.com>
Date: Thu Oct 14 2004 - 16:19:32 GMT
Bill Moseley wrote on 10/13/2004 11:05 PM:

> That is, if the order of the metanames in each index is different and
> that causes problems then that's a bug.


I was able to duplicate the behaviour I experienced. I still don't know 
if it's a bug or I'm just missing something, but here it all is. I have 
seen this under both 2.4.1 and the most recent CVS build, so I know it's 
not just some recent change.

I haven't even looked at the source code yet. If I can get more time on 
this, I will...

In sum:

I made two html files, file1.html and file2.html. Index together, just 
fine. Change the config order and index one file, just fine. Merge the 
two indexes together, and odd things start to happen. If I index the oen 
file with the identical config file as the original two, then the merge 
works as expected.

A -T index_all dump of the two merged indexes (one with identical 
config, one with reversed order config) shows that the metaname numbers 
are being incremented in the reverse order version. See the dumps below.

Details:

This is long. Sorry.
The files have identical meta content, but slightly different words 
overall (just for comparison).

pubs@topaz08 170% cat file1.html
<html>
<head>
<title>file1</title>
<meta name='metaA' content='foo'>
<meta name='metaB' content='bar'>
</head>

<body>
some content
</body>
</html>

pubs@topaz08 171% cat file2.html
<html>
<head>
<title>file2</title>
<meta name='metaA' content='foo'>
<meta name='metaB' content='bar'>
</head>

<body>
some more content
</body>
</html>


I created a simple config:

MetaNames metaA metaB
PropertyNames metaA metaB
IgnoreTotalWordCountWhenRanking 0

Then I indexed the two files. A search works as expected:

pubs@topaz08 172% swish-e -w 'metaa=foo'
# SWISH format: 2.4.1
# Search words: metaa=foo
# Removed stopwords:
# Number of hits: 2
# Search time: 0.068 seconds
# Run time: 0.112 seconds
1000 file2.html "file2" 153
1000 file1.html "file1" 148
.

Then I created a new config, identical but with the order of the 
metanames and properties reversed:

pubs@topaz08 168% cat configafter
MetaNames metaB metaA
PropertyNames metaB metaA
IgnoreTotalWordCountWhenRanking 0

I indexed just file1.html and tested the search:

pubs@topaz08 173% swish-e -w 'metaa=foo' -f fileone.index
# SWISH format: 2.4.1
# Search words: metaa=foo
# Removed stopwords:
# Number of hits: 1
# Search time: 0.082 seconds
# Run time: 0.126 seconds
1000 file1.html "file1" 148
.

All good so far.

Then I merged fileone.index and the index.swish-e:

swish-e -M fileone.index index.swish-e newmerge

Now the search does not work as expected:

pubs@topaz08 174% swish-e -w 'metaa=foo' -f newmerge
# SWISH format: 2.4.1
# Search words: metaa=foo
# Removed stopwords:
# Number of hits: 1
# Search time: 0.092 seconds
# Run time: 0.135 seconds
1000 file1.html "file1" 148
.

file2.html is missing.

Here's the dump of the merged index with the config order reversed. NOTE 
that there are different meta numbers for each file, even though the 
metaname (metaA and metaB) are the same:

pubs@topaz08 165% swish-e -T index_all -f newmerge
...
-----> METANAMES for newmerge <-----
         swishdefault : id= 1 type= 1  META_INDEX  Rank Bias=  0
        swishreccount : id= 2 type=42  META_INTERNAL META_PROP:NUMBER
            swishrank : id= 3 type=42  META_INTERNAL META_PROP:NUMBER
         swishfilenum : id= 4 type=42  META_INTERNAL META_PROP:NUMBER
          swishdbfile : id= 5 type=38  META_INTERNAL 
META_PROP:STRING(case:compare) SortKeyLen: 100
         swishdocpath : id= 6 type= 6  META_PROP:STRING(case:compare) 
SortKeyLen: 100  *presorted*
           swishtitle : id= 7 type=70  META_PROP:STRING(case:ignore) 
SortKeyLen: 100  *presorted*
         swishdocsize : id= 8 type=10  META_PROP:NUMBER *presorted*
    swishlastmodified : id= 9 type=18  META_PROP:DATE *presorted*
                metab : id=10 type= 1  META_INDEX  Rank Bias=  0
                metaa : id=11 type= 1  META_INDEX  Rank Bias=  0
                metab : id=12 type=70  META_PROP:STRING(case:ignore) 
SortKeyLen: 100  *presorted*
                metaa : id=13 type=70  META_PROP:STRING(case:ignore) 
SortKeyLen: 100  *presorted*


-----> WORD INFO in index newmerge <-----

bar
  Meta:10 file1.html Freq:1 Pos/Struct:8/85
  Meta:11 file2.html Freq:1 Pos/Struct:8/85

content
  Meta:1 file1.html Freq:1 Pos/Struct:12/9
  Meta:1 file2.html Freq:1 Pos/Struct:13/9

file1
  Meta:1 file1.html Freq:1 Pos/Struct:2/7

file2
  Meta:1 file2.html Freq:1 Pos/Struct:2/7

foo
  Meta:10 file2.html Freq:1 Pos/Struct:5/85
  Meta:11 file1.html Freq:1 Pos/Struct:5/85

more
  Meta:1 file2.html Freq:1 Pos/Struct:12/9

some
  Meta:1 file1.html Freq:1 Pos/Struct:11/9
  Meta:1 file2.html Freq:1 Pos/Struct:11/9


-----> FILES in index newmerge <-----
Dumping File Properties for File Number: 1
  (No Properties)

ReadAllDocProperties:
           swishdocpath: 6 ( 10) S: "file1.html"
             swishtitle: 7 (  5) S: "file1"
           swishdocsize: 8 (  4) N: "148"
      swishlastmodified: 9 (  4) D: "2004-10-14 10:42:12"
                  metab:12 (  3) S: "bar"
                  metaa:13 (  3) S: "foo"

ReadSingleDocPropertiesFromDisk:
           swishdocpath: 6 ( 10) S: "file1.html"
             swishtitle: 7 (  5) S: "file1"
           swishdocsize: 8 (  4) N: "148"
      swishlastmodified: 9 (  4) D: "2004-10-14 10:42:12"
                  metab:12 (  3) S: "bar"
                  metaa:13 (  3) S: "foo"

Dumping File Properties for File Number: 2
  (No Properties)

ReadAllDocProperties:
           swishdocpath: 6 ( 10) S: "file2.html"
             swishtitle: 7 (  5) S: "file2"
           swishdocsize: 8 (  4) N: "153"
      swishlastmodified: 9 (  4) D: "2004-10-14 10:42:34"
                  metab:12 (  3) S: "bar"
                  metaa:13 (  3) S: "foo"

ReadSingleDocPropertiesFromDisk:
           swishdocpath: 6 ( 10) S: "file2.html"
             swishtitle: 7 (  5) S: "file2"
           swishdocsize: 8 (  4) N: "153"
      swishlastmodified: 9 (  4) D: "2004-10-14 10:42:34"
                  metab:12 (  3) S: "bar"
                  metaa:13 (  3) S: "foo"


Number of File Entries: 2

pubs@topaz08 176% swish-e -T index_all -f index.swish-e
..
-----> METANAMES for index.swish-e <-----
         swishdefault : id= 1 type= 1  META_INDEX  Rank Bias=  0
        swishreccount : id= 2 type=42  META_INTERNAL META_PROP:NUMBER
            swishrank : id= 3 type=42  META_INTERNAL META_PROP:NUMBER
         swishfilenum : id= 4 type=42  META_INTERNAL META_PROP:NUMBER
          swishdbfile : id= 5 type=38  META_INTERNAL 
META_PROP:STRING(case:compare) SortKeyLen: 100
         swishdocpath : id= 6 type= 6  META_PROP:STRING(case:compare) 
SortKeyLen: 100  *presorted*
           swishtitle : id= 7 type=70  META_PROP:STRING(case:ignore) 
SortKeyLen: 100  *presorted*
         swishdocsize : id= 8 type=10  META_PROP:NUMBER *presorted*
    swishlastmodified : id= 9 type=18  META_PROP:DATE *presorted*
                metaa : id=10 type= 1  META_INDEX  Rank Bias=  0
                metab : id=11 type= 1  META_INDEX  Rank Bias=  0
                metaa : id=12 type=70  META_PROP:STRING(case:ignore) 
SortKeyLen: 100  *presorted*
                metab : id=13 type=70  META_PROP:STRING(case:ignore) 
SortKeyLen: 100  *presorted*


-----> WORD INFO in index index.swish-e <-----

bar
  Meta:11 file1.html Freq:1 Pos/Struct:8/85
  Meta:11 file2.html Freq:1 Pos/Struct:8/85

content
  Meta:1 file1.html Freq:1 Pos/Struct:12/9
  Meta:1 file2.html Freq:1 Pos/Struct:13/9

file1
  Meta:1 file1.html Freq:1 Pos/Struct:2/7

file2
  Meta:1 file2.html Freq:1 Pos/Struct:2/7

foo
  Meta:10 file1.html Freq:1 Pos/Struct:5/85
  Meta:10 file2.html Freq:1 Pos/Struct:5/85

more
  Meta:1 file2.html Freq:1 Pos/Struct:12/9

some
  Meta:1 file1.html Freq:1 Pos/Struct:11/9
  Meta:1 file2.html Freq:1 Pos/Struct:11/9


-----> FILES in index index.swish-e <-----
Dumping File Properties for File Number: 1
  (No Properties)

ReadAllDocProperties:
           swishdocpath: 6 ( 10) S: "file1.html"
             swishtitle: 7 (  5) S: "file1"
           swishdocsize: 8 (  4) N: "148"
      swishlastmodified: 9 (  4) D: "2004-10-14 10:42:12"
                  metaa:12 (  3) S: "foo"
                  metab:13 (  3) S: "bar"

ReadSingleDocPropertiesFromDisk:
           swishdocpath: 6 ( 10) S: "file1.html"
             swishtitle: 7 (  5) S: "file1"
           swishdocsize: 8 (  4) N: "148"
      swishlastmodified: 9 (  4) D: "2004-10-14 10:42:12"
                  metaa:12 (  3) S: "foo"
                  metab:13 (  3) S: "bar"

Dumping File Properties for File Number: 2
  (No Properties)

ReadAllDocProperties:
           swishdocpath: 6 ( 10) S: "file2.html"
             swishtitle: 7 (  5) S: "file2"
           swishdocsize: 8 (  4) N: "153"
      swishlastmodified: 9 (  4) D: "2004-10-14 10:42:34"
                  metaa:12 (  3) S: "foo"
                  metab:13 (  3) S: "bar"

ReadSingleDocPropertiesFromDisk:
           swishdocpath: 6 ( 10) S: "file2.html"
             swishtitle: 7 (  5) S: "file2"
           swishdocsize: 8 (  4) N: "153"
      swishlastmodified: 9 (  4) D: "2004-10-14 10:42:34"
                  metaa:12 (  3) S: "foo"
                  metab:13 (  3) S: "bar"


Number of File Entries: 2

pubs@topaz08 177% swish-e -T index_all -f fileone.index
..

-----> METANAMES for fileone.index <-----
         swishdefault : id= 1 type= 1  META_INDEX  Rank Bias=  0
        swishreccount : id= 2 type=42  META_INTERNAL META_PROP:NUMBER
            swishrank : id= 3 type=42  META_INTERNAL META_PROP:NUMBER
         swishfilenum : id= 4 type=42  META_INTERNAL META_PROP:NUMBER
          swishdbfile : id= 5 type=38  META_INTERNAL 
META_PROP:STRING(case:compare) SortKeyLen: 100
         swishdocpath : id= 6 type= 6  META_PROP:STRING(case:compare) 
SortKeyLen: 100  *presorted*
           swishtitle : id= 7 type=70  META_PROP:STRING(case:ignore) 
SortKeyLen: 100  *presorted*
         swishdocsize : id= 8 type=10  META_PROP:NUMBER *presorted*
    swishlastmodified : id= 9 type=18  META_PROP:DATE *presorted*
                metab : id=10 type= 1  META_INDEX  Rank Bias=  0
                metaa : id=11 type= 1  META_INDEX  Rank Bias=  0
                metab : id=12 type=70  META_PROP:STRING(case:ignore) 
SortKeyLen: 100  *presorted*
                metaa : id=13 type=70  META_PROP:STRING(case:ignore) 
SortKeyLen: 100  *presorted*


-----> WORD INFO in index fileone.index <-----

bar
  Meta:10 file1.html Freq:1 Pos/Struct:8/85

content
  Meta:1 file1.html Freq:1 Pos/Struct:12/9

file1
  Meta:1 file1.html Freq:1 Pos/Struct:2/7

foo
  Meta:11 file1.html Freq:1 Pos/Struct:5/85

some
  Meta:1 file1.html Freq:1 Pos/Struct:11/9


-----> FILES in index fileone.index <-----
Dumping File Properties for File Number: 1
  (No Properties)

ReadAllDocProperties:
           swishdocpath: 6 ( 10) S: "file1.html"
             swishtitle: 7 (  5) S: "file1"
           swishdocsize: 8 (  4) N: "148"
      swishlastmodified: 9 (  4) D: "2004-10-14 10:42:12"
                  metab:12 (  3) S: "bar"
                  metaa:13 (  3) S: "foo"

ReadSingleDocPropertiesFromDisk:
           swishdocpath: 6 ( 10) S: "file1.html"
             swishtitle: 7 (  5) S: "file1"
           swishdocsize: 8 (  4) N: "148"
      swishlastmodified: 9 (  4) D: "2004-10-14 10:42:12"
                  metab:12 (  3) S: "bar"
                  metaa:13 (  3) S: "foo"


Number of File Entries: 1




Here's the -T index_all for newmerge, when the same config was used for 
both index.swish-e and fileone indexes:

-----> METANAMES for newmerge <-----
         swishdefault : id= 1 type= 1  META_INDEX  Rank Bias=  0
        swishreccount : id= 2 type=42  META_INTERNAL META_PROP:NUMBER
            swishrank : id= 3 type=42  META_INTERNAL META_PROP:NUMBER
         swishfilenum : id= 4 type=42  META_INTERNAL META_PROP:NUMBER
          swishdbfile : id= 5 type=38  META_INTERNAL 
META_PROP:STRING(case:compare) SortKeyLen: 100
         swishdocpath : id= 6 type= 6  META_PROP:STRING(case:compare) 
SortKeyLen: 100  *presorted*
           swishtitle : id= 7 type=70  META_PROP:STRING(case:ignore) 
SortKeyLen: 100  *presorted*
         swishdocsize : id= 8 type=10  META_PROP:NUMBER *presorted*
    swishlastmodified : id= 9 type=18  META_PROP:DATE *presorted*
                metaa : id=10 type= 1  META_INDEX  Rank Bias=  0
                metab : id=11 type= 1  META_INDEX  Rank Bias=  0
                metaa : id=12 type=70  META_PROP:STRING(case:ignore) 
SortKeyLen: 100  *presorted*
                metab : id=13 type=70  META_PROP:STRING(case:ignore) 
SortKeyLen: 100  *presorted*


-----> WORD INFO in index newmerge <-----

bar
  Meta:11 file1.html Freq:1 Pos/Struct:8/85
  Meta:11 file2.html Freq:1 Pos/Struct:8/85

content
  Meta:1 file1.html Freq:1 Pos/Struct:12/9
  Meta:1 file2.html Freq:1 Pos/Struct:13/9

file1
  Meta:1 file1.html Freq:1 Pos/Struct:2/7

file2
  Meta:1 file2.html Freq:1 Pos/Struct:2/7

foo
  Meta:10 file1.html Freq:1 Pos/Struct:5/85
  Meta:10 file2.html Freq:1 Pos/Struct:5/85

more
  Meta:1 file2.html Freq:1 Pos/Struct:12/9

some
  Meta:1 file1.html Freq:1 Pos/Struct:11/9
  Meta:1 file2.html Freq:1 Pos/Struct:11/9


-----> FILES in index newmerge <-----
Dumping File Properties for File Number: 1
  (No Properties)

ReadAllDocProperties:
           swishdocpath: 6 ( 10) S: "file1.html"
             swishtitle: 7 (  5) S: "file1"
           swishdocsize: 8 (  4) N: "148"
      swishlastmodified: 9 (  4) D: "2004-10-14 10:42:12"
                  metaa:12 (  3) S: "foo"
                  metab:13 (  3) S: "bar"

ReadSingleDocPropertiesFromDisk:
           swishdocpath: 6 ( 10) S: "file1.html"
             swishtitle: 7 (  5) S: "file1"
           swishdocsize: 8 (  4) N: "148"
      swishlastmodified: 9 (  4) D: "2004-10-14 10:42:12"
                  metaa:12 (  3) S: "foo"
                  metab:13 (  3) S: "bar"

Dumping File Properties for File Number: 2
  (No Properties)

ReadAllDocProperties:
           swishdocpath: 6 ( 10) S: "file2.html"
             swishtitle: 7 (  5) S: "file2"
           swishdocsize: 8 (  4) N: "153"
      swishlastmodified: 9 (  4) D: "2004-10-14 10:42:34"
                  metaa:12 (  3) S: "foo"
                  metab:13 (  3) S: "bar"

ReadSingleDocPropertiesFromDisk:
           swishdocpath: 6 ( 10) S: "file2.html"
             swishtitle: 7 (  5) S: "file2"
           swishdocsize: 8 (  4) N: "153"
      swishlastmodified: 9 (  4) D: "2004-10-14 10:42:34"
                  metaa:12 (  3) S: "foo"
                  metab:13 (  3) S: "bar"


Number of File Entries: 2

-- 
Peter Karman . http://www.cray.com/craydoc/ . karman(at)not-real.cray.com
"I love deadlines. I love the whooshing sound they make as they go by."
         - Douglas Adams
Received on Thu Oct 14 09:19:51 2004