Skip to main content.
home | support | download

Back to List Archive

Conversion of HTML entities and quotes in words

From: Bill Meier <bill(at)>
Date: Sat Apr 28 2001 - 15:20:43 GMT
I have two unrelated questions; this both apply to SWISH-E 2.1-dev-20

First, I noticed that numeric html entities, like &#147; and &#148; are not 
getting converted to the corresponding characters. I thought V2 was 
supposed to handle that. However, after digging in the sources, I notice 
there is an explicit table of numeric entities to convert, and of course 
147 and 148 are not in that table.

I would think that *any* numeric entity should always be converted to its 
character value (of ConvertHTMLEntities of course) Why are just "some" 
&#nnn; values listed in the entities table? I would think that the table 
should not have any of those values, and all numeric values converted. 
Seems like a bug/oversight to me...

Second, I noticed that "printers" quotes (smart quotes, whatever...) 0147 " 
and 0148 " are included in the various "word" entities character lists, 
while the plain quote " is not...

This means items inclosed in plain quotes, are indexed as the word in the 
quotes (without the quotes), but items inclosed in the printers quotes 
(0147 and 0148) are indexed *including* the quotes! (same is true for the 
printers versions of the single quotes).

Seems to me that all quoted strings, regardless of which style of quote is 
used, should be handled the same way? Why are 0147 and 0148 included in the 
word list of characters?

I'd be glad to fix and test solutions to either/both of these problems. The 
latter of course can probably be solved by defining your own word entity 
values in the .conf file, but that is ugly when (I believe) those 
characters shouldn't be part of the word set in the first place!


Bill Meier
Received on Sat Apr 28 15:21:15 2001