For the HTML doc content listed at the bottom of this message, if I run:
/opt/swish-e/bin/swish-e -T PARSED_WORDS -v 3 -i blah.html -f blah.idx
Swish-e's output is:
=== START OUTPUT ===
Indexing Data Source: "File-System"
Indexing "blah.html"
Checking file "blah.html"...
blah.html - Using DEFAULT (HTML2) parser - White-space found word 'December'
White-space found word '2003'
White-space found word 'PSG'
White-space found word 'Playbook'
White-space found word 'product_families=Monitors,Desktop'
White-space found word 'PCs,Desktop'
White-space found word 'PCs'
White-space found word 'Options'
White-space found word 'and'
White-space found word 'Accessories,Handheld'
White-space found word 'PCs,Handheld'
White-space found word 'PCs'
White-space found word 'Options'
White-space found word 'and'
White-space found word 'Accessories,Mobile'
White-space found word 'PCs,Notebook'
White-space found word 'PCs'
White-space found word 'Options'
White-space found word 'and'
White-space found word 'Accessories,Tablet'
White-space found word 'PCs,Thin'
White-space found word 'Clients,Thin'
White-space found word 'Clients'
White-space found word 'Options'
White-space found word 'and'
White-space found word 'Accessories,Windows'
White-space found word 'NT'
White-space found word 'Workstations,Windows'
White-space found word 'Workstations,Workstations'
White-space found word 'Options'
White-space found word 'and'
White-space found word 'Accessories'
White-space found word 'product_lines='
White-space found word 'marketing_programs='
(50 words)
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 24 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: Complete
Writing word hash: Complete
Writing word data: Complete
24 unique words indexed.
4 properties sorted.
1 file indexed. 457 total bytes. 50 total words.
Elapsed time: 00:00:00 CPU time: 00:00:00
Indexing done!
=== END OUTPUT ===
Why is Swish-e finding the "words" listed above, for example,
'product_families=Monitors,Desktop'? Neither '_' nor '=' is in WORDCHARS,
so those strings should be getting broken into component words, shouldn't they?
Swish-e version is: SWISH-E 2.4.0, on HP-UX 11.0.
Thanks for any insight...
Cheers,
David
=== START DOC CONTENT ===
<html>
<head>
<title>December 2003 PSG Playbook</title>
</head>
<body>
<pre>
product_families=Monitors,Desktop PCs,Desktop PCs Options and
Accessories,Handheld PCs,Handheld PCs Options and Accessories,Mobile
PCs,Notebook PCs Options and Accessories,Tablet PCs,Thin Clients,Thin
Clients Options and Accessories,Windows NT Workstations,Windows
Workstations,Workstations Options and Accessories
product_lines=
marketing_programs=
</pre>
</body>
</html>
=== END DOC CONTENT ===
*********************************************************************
Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
*********************************************************************
Received on Thu Dec 18 05:52:15 2003