Skip to main content.
home | support | download

Back to List Archive

Re: Another HTML entities query

From: Peter Karman <peter(at)>
Date: Fri Jan 05 2007 - 04:36:48 GMT
max thom stahl scribbled on 1/4/07 5:06 PM:
> Ok . . . last month I asked about HTML entities and didn't really have a 
> good chance to tweak about with things. What's going on is that the 
> spider is definitely pulling down metadata from my site with entities 
> like &mdash; and &rsquo; and whatnot unencoded, which means it's UTF-8?

those entities resolve to code points that can be represented in UTF-8, yes.

> In, I should be able to find a spot to make a call to 
> HTML::Entities::encode_entities to make it so that what gets output to 
> Swish-e  has those entities encoded, right? What I'm getting now is em 
> dashes are, instead of &mdash;, some bizarre-looking character that 
> looks like an `A' with a box around it. Same story with right single 
> quotes, too. . . .
> Is there some way I can do this?

You could pipe the output of through another filter before passing to 

  % | yourfilter | swish-e -S prog -i stdin

I suggest using something like HTML::Entities or Search::Tools::XML to write 

If you use Search::Tools, you can also use Search::Tools::Transliterate to then 
convert your UTF-8 multi-byte characters to their single-byte equivalents, which 
swish-e can deal with.

Something like:


use Search::Tools::XML;
use Search::Tools::Transliterate;

my $xml = Search::Tools::XML->new;
my $trans = Search::Tools::Transliterate->new;

     print $trans->convert( $xml->unescape( $_ ) );

# end

Peter Karman  .  .  peter(at)
Received on Thu Jan 4 20:36:53 2007