On Mon, Jun 06, 2005 at 11:37:02AM -0700, Thomas Nyman wrote:
> Hi
>
> I'm still struggling a bit with my remote indexing. I can index the
> remote machine directory called arkiv but when I do a search using
> that index I receive hits on the relevant documents but also on
> something called index of arkiv. What that is I dont know.
Did you look at this?
http://192.168.1.2/arkiv/
> http://192.168.1.2/arkiv/?D=A
> http://192.168.1.2/arkiv/?S=A
> http://192.168.1.2/arkiv/?M=A
> http://192.168.1.2/arkiv/?N=D
> http://192.168.1.2/arkiv/
> http://192.168.1.2/arkiv/?D=A
Those are all links on the /arkiv/ page.
> Parsing of undecoded UTF-8 will give garbage when decoding entities
That's from HTML::Parser. I'm not really clear what it means -- or
how to fix. The spider, IIRC, use LWP which uses HTML::Parser to
extract out meta data from the <head> of the document. That can be
disabled, I believe.
Here's that warning:
Parsing of undecoded UTF-8 will give garbage when decoding entities
(W) The first chunk parsed appears to contain undecoded UTF-8 and one
or more argspecs that decode entities are used for the callback
handlers.
The result of decoding will be a mix of encoded and decoded characters
for any entities that expand to characters with code above 127. This
is not a good thing.
The solution is to use the Encode::encode_utf8() on the data before
feeding it to the $p->parse(). For $p->parse_file() pass a file that
has been opened in ":utf8" mode.
The parser can process raw undecoded UTF-8 sanely if the C<utf8_mode>
is enabled or if the "attr", "@attr" or "dtext" argspecs is avoided.
The important thing is to see if you are really indexing what you need
to index. Index a single file that causes that error using the -T
indexed_words feature and make sure everything is indexed.
--
Bill Moseley
moseley@hank.org
Unsubscribe from or help with the swish-e list:
http://swish-e.org/Discussion/
Help with Swish-e:
http://swish-e.org/current/docs
swish-e@sunsite.berkeley.edu
Received on Mon Jun 6 12:00:18 2005