Re: Strange behaviour indexing remote site

From: Bill Moseley <moseley(at)>
Date: Mon Jun 06 2005 - 19:00:17 GMT
On Mon, Jun 06, 2005 at 11:37:02AM -0700, Thomas Nyman wrote:
> Hi
> I'm still struggling a bit with my remote indexing. I can index the  
> remote machine directory called arkiv but when I do a search using  
> that index I receive hits on the relevant documents but also on  
> something called index of arkiv. What that is I dont know.

Did you look at this?


Those are all links on the /arkiv/ page.

> Parsing of undecoded UTF-8 will give garbage when decoding entities  

That's from HTML::Parser.  I'm not really clear what it means -- or
how to fix.  The spider, IIRC, use LWP which uses HTML::Parser to
extract out meta data from the <head> of the document.  That can be
disabled, I believe.

Here's that warning:

Parsing of undecoded UTF-8 will give garbage when decoding entities

    (W) The first chunk parsed appears to contain undecoded UTF-8 and one
    or more argspecs that decode entities are used for the callback

    The result of decoding will be a mix of encoded and decoded characters
    for any entities that expand to characters with code above 127.  This
    is not a good thing.

    The solution is to use the Encode::encode_utf8() on the data before
    feeding it to the $p->parse().  For $p->parse_file() pass a file that
    has been opened in ":utf8" mode.

    The parser can process raw undecoded UTF-8 sanely if the C<utf8_mode>
    is enabled or if the "attr", "@attr" or "dtext" argspecs is avoided.

The important thing is to see if you are really indexing what you need
to index.  Index a single file that causes that error using the -T
indexed_words feature and make sure everything is indexed.

Bill Moseley

Received on Mon Jun 6 12:00:18 2005