Hi,
I'm having a strange problem catching particular words when indexing an
HTML document. Our documents are retrieved from a database using a perl
script, and fed to "swish-e -S prog -i stdin" (in a single stream with
documents separated by Path-Name lines, etc). In this example, the
offending words are contained in a <table> written out in one very long
line (blame our CMS for that). It seems that swish-e, in stripping the
HTML tags, ends up mashing together words that appear on opposite sides
of the string "</td><td>". I.e., in a line containing this snippet:
...loading dock)<br/></td></tr><tr><td> H5</td><td>DCL Hallway...
neither "h5" nor "dcl" show up as indexed words, but instead "h5dcl"
does. Strangely, if I save the document source to a text file and index
it with "swish-e -i file.html", "h5" and "dcl" are correctly indexed as
separate words. I've made sure that our perl script isn't doing
anything funny to the HTML. I've also tried increasing MaxWordLimit (in
case those terrible long lines were the culprit).
Here's my swish.cfg, with some MetaName* and PropertyName* directives
stripped out for brevity:
HTMLLinksMetaName links
ImageLinksMetaName images
IndexAltTagMetaName as-text
FuzzyIndexingMode Stemming_en2
IgnoreTotalWordCountWhenRanking yes
TranslateCharacters :ascii7:
MaxWordLimit 50
Here are snippets from "-T PARSED_WORDS INDEXED_WORDS" when indexing
using "-S prog -i stdin" and "-i file.html", respectively:
---
./spew_documents.pl | swish-e -f index.file -S prog -i stdin -c
~/etc/swish.cfg -T PARSED_WORDS INDEXED_WORDS | less
White-space found word 'dock)H5DCL'
Adding:[120:swishdefault(1)] 'dock' Pos:289 Stuct:0x1 ( FILE )
Adding:[120:details(13)] 'dock' Pos:289 Stuct:0x1 ( FILE )
Adding:[120:swishdefault(1)] 'h5dcl' Pos:290 Stuct:0x1 ( FILE )
Adding:[120:details(13)] 'h5dcl' Pos:290 Stuct:0x1 ( FILE )
---
---
swish-e -i file.html -c ~/etc/swish.cfg -T PARSED_WORDS INDEXED_WORDS | less
White-space found word 'dock)'
Adding:[1:swishdefault(1)] 'dock' Pos:508 Stuct:0x89 ( META
BODY FILE )
Adding:[1:details(13)] 'dock' Pos:508 Stuct:0x89 ( META BODY
FILE )
White-space found word '<A0>H5'
Adding:[1:swishdefault(1)] 'h5' Pos:513 Stuct:0x89 ( META BODY
FILE )
Adding:[1:details(13)] 'h5' Pos:513 Stuct:0x89 ( META BODY FILE )
White-space found word 'DCL'
Adding:[1:swishdefault(1)] 'dcl' Pos:516 Stuct:0x89 ( META
BODY FILE )
Adding:[1:details(13)] 'dcl' Pos:516 Stuct:0x89 ( META BODY FILE )
---
Any ideas?
Thanks,
Matt Stanislawski
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Tue Mar 20 17:15:40 2007