Skip to main content.
home | support | download

Back to List Archive

Odd behaviour indexing files...

From: Gordon Jessop <gjessop(at)not-real.advansis.com>
Date: Tue Jan 29 2002 - 03:30:44 GMT
This is rather odd:

Just downloaded and installed swish-e (linux).  Install went without a
hitch.

I tested it by attempting to index one directory that contained a few (9)
small files.  None of them were indexed (although their content was distinct
and carried varied, unique words).  The output is as follows:

    $ swish-e -c search.conf
    Indexing Data Source: "File-System"
    Indexing /path/to/test-dir..

    Checking dir "/path/to/test-dir"...

    In dir "/path/to/test-dir":
      000002 (42 words)
      000003 (42 words)
      000004 (42 words)
      000005 (8 words)
      000006 (6 words)
      000007 (26 words)
      000008 (188 words)
      ecommerce.html (246 words)
      index.html (188 words)

    Removing very common words...
    343 words removed.
    7 words removed not in common words array:
    i, 1, 5, -1, e, 7, 14,
    Writing main index...
    Computing hash table ...
    Writing header ...
    Writing index entries ...
    Writing stopwords ...
    no unique words indexed.
    Writing file index...
    Writing file list ...
    Writing file offsets ...
    Writing MetaNames ...
    Writing offsets (2)...
    9 files indexed.


No words were indexed, as you can see, and any search returns the following:

    $ swish-e -w "famous" -f /path/to/search.index
    # Swish-e format 2.0
    #
    # Name: (no name)
    # Saved as: search.index
    # Counts: 20 words
    # Indexed on: 28/01/2002 20:03:02 EST
    # Description: (no description)
    # Pointer: (no pointer)
    # Maintained by: (no maintainer)
    # DocumentProperties: Enabled
    # Stemming Applied: 0
    # Soundex Applied: 0
    # WordCharacters: &'-0123456789@\_abcdefghijklmnopqrstuvwxyz
    # MinWordLimit: 3
    # MaxWordLimit: 15
    # BeginCharacters: &'-0123456789@\_abcdefghijklmnopqrstuvwxyz
    # EndCharacters: &'-0123456789@\_abcdefghijklmnopqrstuvwxyz
    # IgnoreFirstChar: '(
    # IgnoreLastChar: '),.;
    # SWISH format 2.0
    err: the index file(s) is empty


THE REALLY ODD PART:

After testing, testing, and testing (permissions, file extensions,
Min/MaxWordLimits config.h parameters) I went back to the original conf file
that produced the above results.  But this time I simply added more files to
the test directory.  Lo and behold, indexing occurs:


    $ swish-e -c search.conf
    Indexing Data Source: "File-System"
    Indexing /path/to/test-dir..

    Checking dir "/path/to/test-dir"...

    In dir "/path/to/test-dir":
      000002 (42 words)
      000003 (42 words)
      000004 (42 words)
      000005 (8 words)
      000006 (6 words)
      000007 (26 words)
      000008 (188 words)
      foo.html (188 words)
      bar.html (143 words)
      baz.html (151 words)
      bling.html (162 words)
      blang.html (143 words)
      blong.html (249 words)
      ting.html (245 words)
      tang.html (144 words)
      tong.html (246 words)
      all.html (246 words)
      your.html (281 words)
      base.html (188 words)
      are.html (216 words)
      belong.html (214 words)
      to.html (217 words)
      us.html (28 words)

    Removing very common words...
    351 words removed.
    15 words removed not in common words array:
    i, 1, 5, -1, e, 7, 14, ad, 4, &, 3, s, r, o, 76,
    Writing main index...
    Computing hash table ...
    Writing header ...
    Writing index entries ...
    Writing stopwords ...
    93 unique words indexed.
    Writing file index...
    Writing file list ...
    Writing file offsets ...
    Writing MetaNames ...
    Writing offsets (2)...
    23 files indexed.
    Running time: Less than a second.
    Indexing done!



And the original files that were not indexed the first time, seem to be
indexed this time as seen in a search for a word that is only contained in
one of the original files:



    $ swish-e -w "famous" -f /path/to/search.index
    # Swish-e format 2.0
    #
    # Name: (no name)
    # Saved as: search.index
    # Counts: 93 words, 23 files
    # Indexed on: 28/01/2002 20:25:29 EST
    # Description: (no description)
    # Pointer: (no pointer)
    # Maintained by: (no maintainer)
    # DocumentProperties: Enabled
    # Stemming Applied: 0
    # Soundex Applied: 0
    # WordCharacters: &'-0123456789@\_abcdefghijklmnopqrstuvwxyz
    # MinWordLimit: 3
    # MaxWordLimit: 15
    # BeginCharacters: &'-0123456789@\_abcdefghijklmnopqrstuvwxyz
    # EndCharacters: &'-0123456789@\_abcdefghijklmnopqrstuvwxyz
    # IgnoreFirstChar: '(
    # IgnoreLastChar: '),.;
    # SWISH format 2.0
    # Search words: famous
    # Number of hits: 1
    1000 /path/to/test_dir/000005 "000005" 228


Any ideas?  Is there some sort of "minimum file" flag?  Thanks.

--
Advansis: http://www.advansis.com/
Received on Tue Jan 29 03:31:13 2002