I have two unrelated questions; this both apply to SWISH-E 2.1-dev-20
First, I noticed that numeric html entities, like “ and ” are not
getting converted to the corresponding characters. I thought V2 was
supposed to handle that. However, after digging in the sources, I notice
there is an explicit table of numeric entities to convert, and of course
147 and 148 are not in that table.
I would think that *any* numeric entity should always be converted to its
character value (of ConvertHTMLEntities of course) Why are just "some"
&#nnn; values listed in the entities table? I would think that the table
should not have any of those values, and all numeric values converted.
Seems like a bug/oversight to me...
Second, I noticed that "printers" quotes (smart quotes, whatever...) 0147 "
and 0148 " are included in the various "word" entities character lists,
while the plain quote " is not...
This means items inclosed in plain quotes, are indexed as the word in the
quotes (without the quotes), but items inclosed in the printers quotes
(0147 and 0148) are indexed *including* the quotes! (same is true for the
printers versions of the single quotes).
Seems to me that all quoted strings, regardless of which style of quote is
used, should be handled the same way? Why are 0147 and 0148 included in the
word list of characters?
I'd be glad to fix and test solutions to either/both of these problems. The
latter of course can probably be solved by defining your own word entity
values in the .conf file, but that is ugly when (I believe) those
characters shouldn't be part of the word set in the first place!
Thanks,
Bill Meier
Received on Sat Apr 28 15:21:15 2001