On Tue, Aug 12, 2003 at 07:02:51AM -0700, Greg Ford wrote:
> I've looked at the FAQ and done some tests - it seems that
> if those files are in xml or html, libxml2 will convert them to 8859-1
> But in my tests, latin character with accents e.g AMACRON (ā)
> are not indexed. I was hoping they would be converted
> to the plain letter (a) - stripping the accents off would make my
> data conveniently searchable.
Sorry, no good solution here.
With ParserWarnLevel set,
1.txt:1: warning: Failed to convert internal UTF-8 to Latin-1.
Replacing non ISO-8859-1 char with char ' '
andes foo ā
^
That message is from libxml2.
One option would be to examine the text before converting and do some
text replacement before calling UTF8Toisolat1() (which is a libxml2
function) on the UTF-8 source string.
Or perhaps try to figure out what the UTF-8 character is and then use
that instead of the space (ENCODE_ERROR_CHAR) character.
The next step is to edit parser.c and replace the code that converts
from UTF-8 to latin-1 with a call to iconv, and allow setting character
sets in the swish-e config file.
--
Bill Moseley
moseley@hank.org
Received on Tue Aug 12 16:36:09 2003