qOn Mon, 31 Mar 2003, Nuno Ferreira wrote:
> I am not the sysadmin of the remote sites. I'll try to speak to them.
> I can test any patch that you want me to try.
You can just make a local copy of spider.pl, so you shouldn't need the
help of the sysadmin.
Then, as long as you are running something like Perl 5.6.1 or newer look
in spider.pl for:
my $headers = join "\n",
'Path-Name: ' . $uri,
'Content-Length: ' . length $$content,
'';
and replace it with something like:
my $doc_length = do { use bytes; length $$content };
my $headers = join "\n",
'Path-Name: ' . $uri,
'Content-Length: $doc_length',
'';
I suppose you might even be able to just place:
use bytes;
toward the top of spider.pl and it would work, too. But there might be
some other side-effects so the above might be a safer fix for now.
>
> Regards,
> Nuno
>
> > -----Original Message-----
> > From: Bill Moseley [mailto:moseley@hank.org]
> > Sent: segunda-feira, 31 de Março de 2003 15:35
> > To: Nuno Ferreira
> > Cc: Multiple recipients of list
> > Subject: Re: [SWISH-E] External program failed to return
> > required headers Path-Name: & Content-Length:
> >
> >
> > On Mon, 31 Mar 2003, Nuno Ferreira wrote:
> >
> > > It starts and it looks like it is doing everything I want, then it
> > > suddenly crashes with:
> > > <SNIP>
> > > Looking at extracted tag '<td background="/images/verao_foo_d.jpg">'
> > > ! Found 0 links in
> > >
> > http://www.somesite.com/catalog/formas.php?PHPSESSID=85c724f87
> > fc7f0e6842
> > > 5e6454bb4e11d
> > >
> > http://www.somesite.com/catalog/detras_loja.php?PHPSESSID=85c7
> > 24f87fc7f0
> > > e68425e6454bb4e11d - Using DEFAULT (HTML2) parser - (565 words)
> > > err: External program failed to return required headers Path-Name: &
> > > Content-Length:
> > > .
> > > </SNIP>
> > >
> > > It always crashes in the same place. If I spider a
> > different site, it
> > > crashes also and always in the same place.
> > > I've found this thread
> > <http://swish-e.org/archive/3817.html> that is
> > > related to my problem but after reading it, I became even
> > more confused
> > > because now I know that I may be looking at the wrong debug
> > line because
> > > of the beffering issues.
> >
> > First, see if this if a possible fix:
> >
> http://swish-e.org/archive/4870.html
>
>
> If you set debug => DEBUG_URL then it will display the URLs as they are
> fetched and before swish gets the document. That should help find the
> exact document where the problem is happening.
>
> But that error "failed to return required headers" is likely due to the
> *previous* document returning the wrong content length. The way extprog
> works is it reads line-by-line to read the headers. Then when it sees a
> blank line (that marks the end of the headers) it reads content-length
> bytes in from the external program and starts over.
>
> If that content length was short one byte, and last byte of the doc is a
> \n then when it starts to read the next doc it will see just \n and
> assume
> that's the end of the headers. But at that point no Content-Length or
> Path-Name header is set so the program aborts with that error.
>
> I suspect what is happening is that previous document has a wide char
> and
> forcing perl into UTF-8 encoding. spider.pl is using "length" to
> determine the length of the string, but that's the character lenght not
> the byte length:
>
> $ perl -MDevel::Peek -e '$x=chr(400);Dump($x);print "len=", length$x,
> "\n"'
> SV = PV(0x80f6344) at 0x80fd2a4
> REFCNT = 1
> FLAGS = (POK,pPOK,UTF8)
> PV = 0x80f9e58 "\306\220"\0
> CUR = 2
> LEN = 3
> len=1
>
> So the length of the string is two bytes, but "length" is returning one.
> That would result in your problem.
>
> I need to find a portable way for use with all versions of Perl to read
> the correct byte length.
>
>
>
--
Bill Moseley moseley@hank.org
Received on Mon Mar 31 21:25:46 2003