Skip to main content.
home | support | download

Back to List Archive

ambiguous FAQ answer [Swish-E + 8-bit data]

From: ____Textpert Alert____ <ianf(at)not-real.random.se>
Date: Sun Mar 01 1998 - 21:56:43 GMT
Hello,
  according to the somewhat ambiguous explanation in
  http://sunsite.berkeley.edu/SWISH-E/manual.html

# Can I index 8-bit text? 

#     Yes, if the text uses the HTML equivalents for the ISO-Latin-1
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#     (ISO8859-1) character set.  Upon indexing swish-e will convert
#     all numbered entities it finds (such as &#169;) to named
#     entities (such as &copy;).  To search for words including
#     these codes, type the named entity (if it exists) in place of
#     the 8-bit character.  Swish will also convert entities to
#     ASCII equivalents, so words that might look like this in HTML:
#     resum&eacute; can be searched as this: resume.


  the Swish-e can _solely_ be used for indexing of files 
  containing HTML entities, ie. the 7-bit equivalents of 
  8-bit Latin-1 text; ergo it can be used FOR 7-BIT ONLY,
  *NOT* FOR 8-BIT TEXT FILES.  I'm not quite sure why there
  should be such a restriction, or else my reading of the
  above is all wrong.

  But what about other 8-bit character sets for which there 
  are no standartized HTML equivalents, like the ISO-8859-2
  (Latin-2) alphabet?  Would Swish-e index these correctly
  -- also possibly with a custom 'IgnoreWords' directive or
  stopwords.conf file ?  If Swish-e cannot be used, does anyone 
  know of a suitable equivalent freeware solution for FreeBSD ?


  Please R)eply with Cc: ianf@random.se

  Thanks much in advance.


__Ian
Received on Sun Mar 1 14:04:02 1998