max thom stahl scribbled on 1/4/07 5:06 PM:
> Ok . . . last month I asked about HTML entities and didn't really have a
> good chance to tweak about with things. What's going on is that the
> spider is definitely pulling down metadata from my site with entities
> like — and ’ and whatnot unencoded, which means it's UTF-8?
>
those entities resolve to code points that can be represented in UTF-8, yes.
> In spider.pl, I should be able to find a spot to make a call to
> HTML::Entities::encode_entities to make it so that what gets output to
> Swish-e has those entities encoded, right? What I'm getting now is em
> dashes are, instead of —, some bizarre-looking character that
> looks like an `A' with a box around it. Same story with right single
> quotes, too. . . .
>
> Is there some way I can do this?
>
You could pipe the output of spider.pl through another filter before passing to
swish-e.
% spider.pl | yourfilter | swish-e -S prog -i stdin
I suggest using something like HTML::Entities or Search::Tools::XML to write
yourfilter.
If you use Search::Tools, you can also use Search::Tools::Transliterate to then
convert your UTF-8 multi-byte characters to their single-byte equivalents, which
swish-e can deal with.
Something like:
#!/usr/bin/perl
use Search::Tools::XML;
use Search::Tools::Transliterate;
my $xml = Search::Tools::XML->new;
my $trans = Search::Tools::Transliterate->new;
while(<>)
{
print $trans->convert( $xml->unescape( $_ ) );
}
# end
--
Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com
Received on Thu Jan 4 20:36:53 2007