Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] multiple Warnings: 'could not be encoded to charset 'ISO-8859-1'

From: Dr Michael Daly <"Dr>
Date: Sat, 17 Mar 2012 01:03:19 +1100 (EST)
I am invoking indexing via
swish-e -S prog -c /share/MD0_DATA/swish-e-files/swish-e-conf/web_2.conf
********************************************************************************************************************************************************
web_2.conf contents:
 IndexDir spider.pl
 SwishProgParameters /share/MD0_DATA/swish-e-files/swish-e-conf/spider.config

 IndexOnly .htm .html .txt .doc .pdf .xls

 IndexContents TXT* .txt .xls
  # Otherwise, use the HTML parser
  DefaultContents HTML*
# I have only added the FileFilter options today ie Friday, ie to web_2.conf
      	FileFilter .pdf pdftotext   "'%p' -"
	FileFilter .doc catdoc "-s8859-1 -d8859-1 %p"
	FileFilter .xls xls2csv "-s8859-1 -d8859-1 %p"

  ReplaceRules remove /share/MD0_DATA/server_dir/
  Metanames swishtitle swishdocpath
  StoreDescription TXT 200
  StoreDescription HTML <body> 200
  IndexFile /share/MD0_DATA/swish-e-files/swish-e-index/swish_2.index

****************************************************************************
****************************************************************************
spider.config contents:
(at)not-real.servers = (
    {
	base_url    => 'http://localhost:104/_docs/test3/',
	#base_url    => 'http://localhost:104/_docs/test3/Reception-duties.doc',
	email               => 'swish(at)not-real.user.failed.to.set.email.invalid',
        link_tags           => [qw/ a frame /],
        keep_alive          => 1,
        test_url            => sub {  $_[0]->path !~
/\.(?:gif|jpeg|png)$/i },
        test_response       => $response_sub,
        use_head_requests   => 1,  # Due to the response sub
        filter_content      => $filter_sub,
	debug	=> 'errors, failed, headers, info, links, redirect, skipped, url',

    } );


****************************************************************************
****************************************************************************
swish-e -S prog -c /share/MD0_DATA/swish-e-files/swish-e-conf/web_2.conf
(all seems to be well when indexing *only* one file, ie I have this
setting in spider.config:
base_url    => 'http://localhost:104/_docs/test3/Reception-duties.doc')

Indexing Data Source: "External-Program"
Indexing "spider.pl"
External Program found: /opt/lib/swish-e/spider.pl
Missing argument in sprintf at /opt/lib/swish-e/spider.pl line 38.
Missing argument in sprintf at /opt/lib/swish-e/spider.pl line 38.
/opt/lib/swish-e/spider.pl: Reading parameters from
'/share/MD0_DATA/swish-e-files/swish-e-conf/spider.config'

 -- Starting to spider:
http://localhost:104/_docs/test3/Reception-duties.doc --
?Testing 'test_url' user supplied function #1
'http://localhost:104/_docs/test3/Reception-duties.doc'
+Passed all 1 tests for 'test_url' user supplied function

vvvvvvvvvvvvvvvv HEADERS for
http://localhost:104/_docs/test3/Reception-duties.doc
vvvvvvvvvvvvvvvvvvvvv

---- Request ------
HEAD http://localhost:104/_docs/test3/Reception-duties.doc
Accept-Encoding: gzip, x-gzip, deflate
From: swish(at)not-real.user.failed.to.set.email.invalid
User-Agent: swish-e http://swish-e.org/


---- Response ---
Status: 200 OK
Date: Fri, 16 Mar 2012 12:18:49 GMT
Accept-Ranges: bytes
ETag: "3a2030d-9a00-4bb3e0f6cd4aa"
Server: Apache
Content-Length: 39424
Content-Type: application/msword
Last-Modified: Thu, 15 Mar 2012 01:32:07 GMT
Client-Date: Fri, 16 Mar 2012 12:18:49 GMT
Client-Peer: 127.0.0.1:104
Client-Response-Num: 2

^^^^^^^^^^^^^^^ END HEADERS ^^^^^^^^^^^^^^^^^^^^^^^^^^


vvvvvvvvvvvvvvvv HEADERS for
http://localhost:104/_docs/test3/Reception-duties.doc
vvvvvvvvvvvvvvvvvvvvv

---- Request ------
GET http://localhost:104/_docs/test3/Reception-duties.doc
Accept-Encoding: gzip, x-gzip, deflate
From: swish(at)not-real.user.failed.to.set.email.invalid
User-Agent: swish-e http://swish-e.org/


---- Response ---
Status: 200 OK
Date: Fri, 16 Mar 2012 12:18:49 GMT
Accept-Ranges: bytes
ETag: "3a2030d-9a00-4bb3e0f6cd4aa"
Server: Apache
Content-Length: 39424
Content-Type: application/msword
Last-Modified: Thu, 15 Mar 2012 01:32:07 GMT
Client-Date: Fri, 16 Mar 2012 12:18:49 GMT
Client-Peer: 127.0.0.1:104
Client-Response-Num: 3

^^^^^^^^^^^^^^^ END HEADERS ^^^^^^^^^^^^^^^^^^^^^^^^^^

>> +Fetched 0 Cnt: 1 GET 
http://localhost:104/_docs/test3/Reception-duties.doc  200 OK
application/msword 39424 parent: depth:0

Summary for: http://localhost:104/_docs/test3/Reception-duties.doc
Connection: Close:      1  (1.0/sec)
      Total Bytes: 39,424  (39424.0/sec)
       Total Docs:      1  (1.0/sec)
      Unique URLs:      1  (1.0/sec)
http://localhost:104/_docs/test3/Reception-duties.doc:22: error:
htmlParseEntityRef: no name
Thursday & Friday AM
          ^
http://localhost:104/_docs/test3/Reception-duties.doc:22: error:
htmlParseEntityRef: no name
Organising files for Womens & MensHealth Physio for existing clients and
mak
                               ^
http://localhost:104/_docs/test3/Reception-duties.doc:22: error:
htmlParseEntityRef: no name
Thursday & Friday PM
          ^
http://localhost:104/_docs/test3/Reception-duties.doc:43: error:
htmlParseEntityRef: no name

^
http://localhost:104/_docs/test3/Reception-duties.doc:43: error:
htmlParseStartTag: invalid element name

^
http://localhost:104/_docs/test3/Reception-duties.doc:44: error:
htmlParseEntityRef: no name

^
http://localhost:104/_docs/test3/Reception-duties.doc:44: error:
htmlParseStartTag: invalid element name

^
http://localhost:104/_docs/test3/Reception-duties.doc:44: error:
htmlParseEntityRef: no name

 ^
http://localhost:104/_docs/test3/Reception-duties.doc:44: error:
htmlParseEntityRef: no name

^
http://localhost:104/_docs/test3/Reception-duties.doc:49: error:
htmlParseEntityRef: no name

^
http://localhost:104/_docs/test3/Reception-duties.doc:49: error:
htmlParseStartTag: invalid element name

^
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 309 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
309 unique words indexed.
5 properties sorted.
1 file indexed.  39,424 total bytes.  589 total words.
Elapsed time: 00:00:02 CPU time: 00:00:01
Indexing done!
[/root] #


****************************************************************************
****************************************************************************
swish-e -S prog -c /share/MD0_DATA/swish-e-files/swish-e-conf/web_2.conf
(all is NOT well when indexing  a directory, ie I instead have this
setting in spider.config:
base_url    => 'http://localhost:104/_docs/test3/'
what happens here is that 'links' are found, even though no links are
present in MS-word documents; also there is trouble when swish-e
encounters an encrypted .zip file...I can see it is spending a lot of time
here...see below)

it starts off normally:
Indexing Data Source: "External-Program"
Indexing "spider.pl"
External Program found: /opt/lib/swish-e/spider.pl
Missing argument in sprintf at /opt/lib/swish-e/spider.pl line 38.
Missing argument in sprintf at /opt/lib/swish-e/spider.pl line 38.
/opt/lib/swish-e/spider.pl: Reading parameters from
'/share/MD0_DATA/swish-e-files/swish-e-conf/spider.config'

 -- Starting to spider: http://localhost:104/_docs/test3/ --
?Testing 'test_url' user supplied function #1
'http://localhost:104/_docs/test3/'
+Passed all 1 tests for 'test_url' user supplied function

vvvvvvvvvvvvvvvv HEADERS for http://localhost:104/_docs/test3/
vvvvvvvvvvvvvvvvvvvvv

---- Request ------
HEAD http://localhost:104/_docs/test3/
Accept-Encoding: gzip, x-gzip, deflate
From: swish(at)not-real.user.failed.to.set.email.invalid
User-Agent: swish-e http://swish-e.org/


---- Response ---
Status: 200 OK
Date: Fri, 16 Mar 2012 13:53:03 GMT
Server: Apache
Content-Type: text/html;charset=ISO-8859-1
Client-Date: Fri, 16 Mar 2012 13:53:03 GMT
Client-Peer: 127.0.0.1:104
Client-Response-Num: 2

^^^^^^^^^^^^^^^ END HEADERS ^^^^^^^^^^^^^^^^^^^^^^^^^^


vvvvvvvvvvvvvvvv HEADERS for http://localhost:104/_docs/test3/
vvvvvvvvvvvvvvvvvvvvv

---- Request ------
GET http://localhost:104/_docs/test3/
Accept-Encoding: gzip, x-gzip, deflate
From: swish(at)not-real.user.failed.to.set.email.invalid
User-Agent: swish-e http://swish-e.org/


---- Response ---
Status: 200 OK
Date: Fri, 16 Mar 2012 13:53:03 GMT
Server: Apache
Content-Length: 423
Content-Type: text/html;charset=ISO-8859-1
Client-Date: Fri, 16 Mar 2012 13:53:03 GMT
Client-Peer: 127.0.0.1:104
Client-Response-Num: 3

^^^^^^^^^^^^^^^ END HEADERS ^^^^^^^^^^^^^^^^^^^^^^^^^^

>> +Fetched 0 Cnt: 1 GET  http://localhost:104/_docs/test3/  200 OK
text/html 423 parent: depth:0

Extracting links from http://localhost:104/_docs/test3/:

Looking at extracted tag '<a href="/_docs/">'
?Testing 'test_url' user supplied function #1 'http://localhost:104/_docs/'
+Passed all 1 tests for 'test_url' user supplied function
   href="http://localhost:104/_docs/" Added to list of links to follow

Looking at extracted tag '<a href="Reception-duties-2.doc">'
?Testing 'test_url' user supplied function #1
'http://localhost:104/_docs/test3/Reception-duties-2.doc'
+Passed all 1 tests for 'test_url' user supplied function
   href="http://localhost:104/_docs/test3/Reception-duties-2.doc" Added to
list of links to follow

Looking at extracted tag '<a href="Reception-duties.doc">'
?Testing 'test_url' user supplied function #1
'http://localhost:104/_docs/test3/Reception-duties.doc'
+Passed all 1 tests for 'test_url' user supplied function
   href="http://localhost:104/_docs/test3/Reception-duties.doc" Added to
list of links to follow
! Found 3 links in http://localhost:104/_docs/test3/

Warning: document 'http://localhost:104/_docs/test3/' could not be encoded
to charset 'ISO-8859-1'

vvvvvvvvvvvvvvvv HEADERS for http://localhost:104/_docs/
vvvvvvvvvvvvvvvvvvvvv


***************************************************************************
****************************************************************************
There are lots of this kind of msg:
Looking at extracted tag '<a
href="Annual%20General%20Meeting%2031stAug_2009%20Draft%20Minutes.pdf">'
?Testing 'test_url' user supplied function #1
'http://localhost:104/_docs/Annual%20General%20Meeting%2031stAug_2009%20Draft%20Minutes.pdf'
+Passed all 1 tests for 'test_url' user supplied function
   href="http://localhost:104/_docs/Annual%20General%20Meeting%2031stAug_2009%20Draft%20Minutes.pdf"
Added to list of links to follow

************************************************************************************
***********************************************************************************

when the encryted zip file is encountered:
http://localhost:104/_docs/2008%20Log%20of%20hours2.zip:49: error:
htmlParseStartTag: invalid element name
¬!ðR<»
                  ^
httpcèü½±éee(z-7Hëþ}\¹cp¢õI<2.zip:50: error:
htmlParseStartTag: invalid element name
                                                                               ^
http://localhost:104/_docs/2008%20Log%20of%20hours2.zip:51: error:
htmlParseEntityRef: no name
´n\IS"dAä      &¨=«KþdKe(­|Zñº=!O^ý
                               ^
http://localhost:104/_docs/2008%20Log%20of%20hours2.zip:51: error:
htmlParseStartTag: invalid element name
rZúlãB§O¿9¨T¨d+¯:
       5êê
§>GmQ      )+é



why is your .xls being indexed as .pdf?

What are the contents of
/share/MD0_DATA/swish-e-files/swish-e-conf/web_1.conf
?

again, break this down to a single URL to isolate your problem. Try
turning on
the spider debug options too:

http://swish-e.org/docs/spider.html#debug



--
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
_______________________________________________
Users mailing list
Users(at)not-real.lists.swish-e.org
http://lists.swish-e.org/listinfo/users


Thanks!


_______________________________________________
Users mailing list
Users(at)not-real.lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Fri Mar 16 2012 - 14:13:29 GMT