any chance you compiled swish-e *without* libxml2 support?
try reindexing with:
swish-e -v 9 -W 3 -c index.cfg
and see which parser is being used. I see HTML2 by default.
On 10/25/2007 10:28 AM, josh@relativelysane.com wrote:
>> On 10/25/2007 10:07 AM, josh@relativelysane.com wrote:
>>
>>> The weird thing is that its grabbing and populating flavor, and I know thats
>>from the ProperyName string because when I remove it from there; flavor isn't
>> in the dump like the one above.
>> can you copy/paste what your config and example docs look like so we can try
>> and duplicate what you are seeing?
>>
>> --
>> Peter Karman . peter(at)not-real.peknet.com . http://peknet.com/
>>
>
> Sure, they are literally identical to what you used in your example (with the exception of the IndexDir field in my config). Full dumps of the files, the indexing status, the -T INDEX_ALL, and the search query are below.
>
> [josh@josh]# cat index.cfg
> IndexDir test
> ExtractPath flavor regex !test/doc-(normal|strong|href)/.*$!$1!
> PropertyNames flavor strong a
>
> [josh@josh test]# ls
> doc-href doc-normal doc-strong
>
> [josh@josh doc-href]# cat docswith-ahref.html
> <html>
> <head><title>real title</title></head>
> <body><a href="bar">title i want</a></body>
> </html>
>
> [josh@josh doc-normal]# cat docsthatarenormal.html
> <html>
> <head><title>real title is the title i want</title></head>
> <body><a href="bar">link text</a><strong>strong text</strong> blah </body>
> </html>
>
> [josh@josh doc-strong]# cat docswith-strong.html
> <html>
> <head><title>real title</title></head>
> <body><strong>title I want</strong></body>
> </html>
>
>
> [josh@josh]# swish-e -c index.cfg
> Indexing Data Source: "File-System"
> Indexing "test"
> Removing very common words...
> no words removed.
> Writing main index...
> Sorting words ...
> Sorting 12 words alphabetically
> Writing header ...
> Writing index entries ...
> Writing word text: Complete
> Writing word hash: Complete
> Writing word data: Complete
> 12 unique words indexed.
> 7 properties sorted.
> 3 files indexed. 344 total bytes. 25 total words.
> Elapsed time: 00:00:00 CPU time: 00:00:00
> Indexing done!
>
>
> [josh@josh]# swish-e -T INDEX_ALL
> # Name:
> # Saved as: index.swish-e
> # Total Words: 12
> # Total Files: 3
> # Removed Files: 0
> # Total Word Pos: 25
> # Removed Word Pos: 0
> # Indexed on: 2007-10-25 11:24:19 EDT
> # Description:
> # Pointer:
> # Maintained by:
> # MinWordLimit: 1
> # MaxWordLimit: 40
> # WordCharacters: 0123456789abcdefghijklmnopqrstuvwxyzªµºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ
> # BeginCharacters: 0123456789abcdefghijklmnopqrstuvwxyzªµºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ
> # EndCharacters: 0123456789abcdefghijklmnopqrstuvwxyzªµºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ
> # IgnoreFirstChar:
> # IgnoreLastChar:
> # StopWords:
> # BuzzWords:
> # Stemming Applied: 0
> # Soundex Applied: 0
> # Fuzzy Mode: None
> # IgnoreTotalWordCountWhenRanking: 1
>
>
> -----> METANAMES for index.swish-e <-----
> swishdefault : id= 1 type= 1 META_INDEX Rank Bias= 0
> swishreccount : id= 2 type=42 META_INTERNAL META_PROP:NUMBER
> swishrank : id= 3 type=42 META_INTERNAL META_PROP:NUMBER
> swishfilenum : id= 4 type=42 META_INTERNAL META_PROP:NUMBER
> swishdbfile : id= 5 type=38 META_INTERNAL META_PROP:STRING(case:compare) SortKeyLen: 100
> swishdocpath : id= 6 type= 6 META_PROP:STRING(case:compare) SortKeyLen: 100 *presorted*
> swishtitle : id= 7 type=70 META_PROP:STRING(case:ignore) SortKeyLen: 100 *presorted*
> swishdocsize : id= 8 type=10 META_PROP:NUMBER *presorted*
> swishlastmodified : id= 9 type=18 META_PROP:DATE *presorted*
> flavor : id=10 type= 1 META_INDEX Rank Bias= 0
> flavor : id=11 type=70 META_PROP:STRING(case:ignore) SortKeyLen: 100 *presorted*
> strong : id=12 type=70 META_PROP:STRING(case:ignore) SortKeyLen: 100 *presorted*
> a : id=13 type=70 META_PROP:STRING(case:ignore) SortKeyLen: 100 *presorted*
>
>
> -----> WORD INFO in index index.swish-e <-----
>
> blah
> Meta:1 test/doc-normal/docsthatarenormal.html Freq:1 Pos/Struct:12/9
>
> href
> Meta:10 test/doc-href/docswith-ahref.html Freq:1 Pos/Struct:1/1
>
> i
> Meta:1 test/doc-href/docswith-ahref.html Freq:1 Pos/Struct:4/9
> Meta:1 test/doc-normal/docsthatarenormal.html Freq:1 Pos/Struct:6/7
> Meta:1 test/doc-strong/docswith-strong.html Freq:1 Pos/Struct:4/49
>
> is
> Meta:1 test/doc-normal/docsthatarenormal.html Freq:1 Pos/Struct:3/7
>
> link
> Meta:1 test/doc-normal/docsthatarenormal.html Freq:1 Pos/Struct:8/9
>
> normal
> Meta:10 test/doc-normal/docsthatarenormal.html Freq:1 Pos/Struct:1/1
>
> real
> Meta:1 test/doc-href/docswith-ahref.html Freq:1 Pos/Struct:1/7
> Meta:1 test/doc-normal/docsthatarenormal.html Freq:1 Pos/Struct:1/7
> Meta:1 test/doc-strong/docswith-strong.html Freq:1 Pos/Struct:1/7
>
> strong
> Meta:1 test/doc-normal/docsthatarenormal.html Freq:1 Pos/Struct:10/49
> Meta:10 test/doc-strong/docswith-strong.html Freq:1 Pos/Struct:1/1
>
> text
> Meta:1 test/doc-normal/docsthatarenormal.html Freq:2 Pos/Struct:9/9,11/49
>
> the
> Meta:1 test/doc-normal/docsthatarenormal.html Freq:1 Pos/Struct:4/7
>
> title
> Meta:1 test/doc-href/docswith-ahref.html Freq:2 Pos/Struct:2/7,3/9
> Meta:1 test/doc-normal/docsthatarenormal.html Freq:2 Pos/Struct:2/7,5/7
> Meta:1 test/doc-strong/docswith-strong.html Freq:2 Pos/Struct:2/7,3/49
>
> want
> Meta:1 test/doc-href/docswith-ahref.html Freq:1 Pos/Struct:5/9
> Meta:1 test/doc-normal/docsthatarenormal.html Freq:1 Pos/Struct:7/7
> Meta:1 test/doc-strong/docswith-strong.html Freq:1 Pos/Struct:5/49
>
>
> -----> FILES in index index.swish-e <-----
> Dumping File Properties for File Number: 1
> (No Properties)
>
> ReadAllDocProperties:
> swishdocpath: 6 ( 33) S: "test/doc-href/docswith-ahref.html"
> swishtitle: 7 ( 10) S: "real title"
> swishdocsize: 8 ( 4) N: "98"
> swishlastmodified: 9 ( 4) D: "2007-10-25 11:21:41 EDT"
> flavor:11 ( 4) S: "href"
>
> ReadSingleDocPropertiesFromDisk:
> swishdocpath: 6 ( 33) S: "test/doc-href/docswith-ahref.html"
> swishtitle: 7 ( 10) S: "real title"
> swishdocsize: 8 ( 4) N: "98"
> swishlastmodified: 9 ( 4) D: "2007-10-25 11:21:41 EDT"
> flavor:11 ( 4) S: "href"
>
> Dumping File Properties for File Number: 2
> (No Properties)
>
> ReadAllDocProperties:
> swishdocpath: 6 ( 38) S: "test/doc-normal/docsthatarenormal.html"
> swishtitle: 7 ( 30) S: "real title is the title i want"
> swishdocsize: 8 ( 4) N: "149"
> swishlastmodified: 9 ( 4) D: "2007-10-25 11:22:43 EDT"
> flavor:11 ( 6) S: "normal"
>
> ReadSingleDocPropertiesFromDisk:
> swishdocpath: 6 ( 38) S: "test/doc-normal/docsthatarenormal.html"
> swishtitle: 7 ( 30) S: "real title is the title i want"
> swishdocsize: 8 ( 4) N: "149"
> swishlastmodified: 9 ( 4) D: "2007-10-25 11:22:43 EDT"
> flavor:11 ( 6) S: "normal"
>
> Dumping File Properties for File Number: 3
> (No Properties)
>
> ReadAllDocProperties:
> swishdocpath: 6 ( 36) S: "test/doc-strong/docswith-strong.html"
> swishtitle: 7 ( 10) S: "real title"
> swishdocsize: 8 ( 4) N: "97"
> swishlastmodified: 9 ( 4) D: "2007-10-25 11:23:35 EDT"
> flavor:11 ( 6) S: "strong"
>
> ReadSingleDocPropertiesFromDisk:
> swishdocpath: 6 ( 36) S: "test/doc-strong/docswith-strong.html"
> swishtitle: 7 ( 10) S: "real title"
> swishdocsize: 8 ( 4) N: "97"
> swishlastmodified: 9 ( 4) D: "2007-10-25 11:23:35 EDT"
> flavor:11 ( 6) S: "strong"
>
>
> [josh@josh]# swish-e -w title AND flavor=strong -x '"<strong>" "<swishtitle>" "<flavor>"\n'
> # SWISH format: 2.4.5
> # Search words: title AND flavor=strong
> # Removed stopwords:
> # Number of hits: 1
> # Search time: 0.000 seconds
> # Run time: 0.009 seconds
> "" "real title" "strong"
>
>
>
>
> josh
> _______________________________________________
> Users mailing list
> Users@lists.swish-e.org
> http://lists.swish-e.org/listinfo/users
--
Peter Karman . peter(at)not-real.peknet.com . http://peknet.com/
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Thu Oct 25 11:38:03 2007