I am invoking indexing via
swish-e -S prog -c /share/MD0_DATA/swish-e-files/swish-e-conf/web_2.conf
********************************************************************************************************************************************************
web_2.conf contents:
IndexDir spider.pl
SwishProgParameters /share/MD0_DATA/swish-e-files/swish-e-conf/spider.config
IndexOnly .htm .html .txt .doc .pdf .xls
IndexContents TXT* .txt .xls
# Otherwise, use the HTML parser
DefaultContents HTML*
# I have only added the FileFilter options today ie Friday, ie to web_2.conf
FileFilter .pdf pdftotext "'%p' -"
FileFilter .doc catdoc "-s8859-1 -d8859-1 %p"
FileFilter .xls xls2csv "-s8859-1 -d8859-1 %p"
ReplaceRules remove /share/MD0_DATA/server_dir/
Metanames swishtitle swishdocpath
StoreDescription TXT 200
StoreDescription HTML <body> 200
IndexFile /share/MD0_DATA/swish-e-files/swish-e-index/swish_2.index
****************************************************************************
****************************************************************************
spider.config contents:
(at)not-real.servers = (
{
base_url => 'http://localhost:104/_docs/test3/',
#base_url => 'http://localhost:104/_docs/test3/Reception-duties.doc',
email => 'swish(at)not-real.user.failed.to.set.email.invalid',
link_tags => [qw/ a frame /],
keep_alive => 1,
test_url => sub { $_[0]->path !~
/\.(?:gif|jpeg|png)$/i },
test_response => $response_sub,
use_head_requests => 1, # Due to the response sub
filter_content => $filter_sub,
debug => 'errors, failed, headers, info, links, redirect, skipped, url',
} );
****************************************************************************
****************************************************************************
swish-e -S prog -c /share/MD0_DATA/swish-e-files/swish-e-conf/web_2.conf
(all seems to be well when indexing *only* one file, ie I have this
setting in spider.config:
base_url => 'http://localhost:104/_docs/test3/Reception-duties.doc')
Indexing Data Source: "External-Program"
Indexing "spider.pl"
External Program found: /opt/lib/swish-e/spider.pl
Missing argument in sprintf at /opt/lib/swish-e/spider.pl line 38.
Missing argument in sprintf at /opt/lib/swish-e/spider.pl line 38.
/opt/lib/swish-e/spider.pl: Reading parameters from
'/share/MD0_DATA/swish-e-files/swish-e-conf/spider.config'
-- Starting to spider:
http://localhost:104/_docs/test3/Reception-duties.doc --
?Testing 'test_url' user supplied function #1
'http://localhost:104/_docs/test3/Reception-duties.doc'
+Passed all 1 tests for 'test_url' user supplied function
vvvvvvvvvvvvvvvv HEADERS for
http://localhost:104/_docs/test3/Reception-duties.doc
vvvvvvvvvvvvvvvvvvvvv
---- Request ------
HEAD http://localhost:104/_docs/test3/Reception-duties.doc
Accept-Encoding: gzip, x-gzip, deflate
From: swish(at)not-real.user.failed.to.set.email.invalid
User-Agent: swish-e http://swish-e.org/
---- Response ---
Status: 200 OK
Date: Fri, 16 Mar 2012 12:18:49 GMT
Accept-Ranges: bytes
ETag: "3a2030d-9a00-4bb3e0f6cd4aa"
Server: Apache
Content-Length: 39424
Content-Type: application/msword
Last-Modified: Thu, 15 Mar 2012 01:32:07 GMT
Client-Date: Fri, 16 Mar 2012 12:18:49 GMT
Client-Peer: 127.0.0.1:104
Client-Response-Num: 2
^^^^^^^^^^^^^^^ END HEADERS ^^^^^^^^^^^^^^^^^^^^^^^^^^
vvvvvvvvvvvvvvvv HEADERS for
http://localhost:104/_docs/test3/Reception-duties.doc
vvvvvvvvvvvvvvvvvvvvv
---- Request ------
GET http://localhost:104/_docs/test3/Reception-duties.doc
Accept-Encoding: gzip, x-gzip, deflate
From: swish(at)not-real.user.failed.to.set.email.invalid
User-Agent: swish-e http://swish-e.org/
---- Response ---
Status: 200 OK
Date: Fri, 16 Mar 2012 12:18:49 GMT
Accept-Ranges: bytes
ETag: "3a2030d-9a00-4bb3e0f6cd4aa"
Server: Apache
Content-Length: 39424
Content-Type: application/msword
Last-Modified: Thu, 15 Mar 2012 01:32:07 GMT
Client-Date: Fri, 16 Mar 2012 12:18:49 GMT
Client-Peer: 127.0.0.1:104
Client-Response-Num: 3
^^^^^^^^^^^^^^^ END HEADERS ^^^^^^^^^^^^^^^^^^^^^^^^^^
>> +Fetched 0 Cnt: 1 GET
http://localhost:104/_docs/test3/Reception-duties.doc 200 OK
application/msword 39424 parent: depth:0
Summary for: http://localhost:104/_docs/test3/Reception-duties.doc
Connection: Close: 1 (1.0/sec)
Total Bytes: 39,424 (39424.0/sec)
Total Docs: 1 (1.0/sec)
Unique URLs: 1 (1.0/sec)
http://localhost:104/_docs/test3/Reception-duties.doc:22: error:
htmlParseEntityRef: no name
Thursday & Friday AM
^
http://localhost:104/_docs/test3/Reception-duties.doc:22: error:
htmlParseEntityRef: no name
Organising files for WomenÂs & MenÂsHealth Physio for existing clients and
mak
^
http://localhost:104/_docs/test3/Reception-duties.doc:22: error:
htmlParseEntityRef: no name
Thursday & Friday PM
^
http://localhost:104/_docs/test3/Reception-duties.doc:43: error:
htmlParseEntityRef: no name
^
http://localhost:104/_docs/test3/Reception-duties.doc:43: error:
htmlParseStartTag: invalid element name
^
http://localhost:104/_docs/test3/Reception-duties.doc:44: error:
htmlParseEntityRef: no name
^
http://localhost:104/_docs/test3/Reception-duties.doc:44: error:
htmlParseStartTag: invalid element name
^
http://localhost:104/_docs/test3/Reception-duties.doc:44: error:
htmlParseEntityRef: no name
^
http://localhost:104/_docs/test3/Reception-duties.doc:44: error:
htmlParseEntityRef: no name
^
http://localhost:104/_docs/test3/Reception-duties.doc:49: error:
htmlParseEntityRef: no name
^
http://localhost:104/_docs/test3/Reception-duties.doc:49: error:
htmlParseStartTag: invalid element name
^
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 309 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: Complete
Writing word hash: Complete
Writing word data: Complete
309 unique words indexed.
5 properties sorted.
1 file indexed. 39,424 total bytes. 589 total words.
Elapsed time: 00:00:02 CPU time: 00:00:01
Indexing done!
[/root] #
****************************************************************************
****************************************************************************
swish-e -S prog -c /share/MD0_DATA/swish-e-files/swish-e-conf/web_2.conf
(all is NOT well when indexing a directory, ie I instead have this
setting in spider.config:
base_url => 'http://localhost:104/_docs/test3/'
what happens here is that 'links' are found, even though no links are
present in MS-word documents; also there is trouble when swish-e
encounters an encrypted .zip file...I can see it is spending a lot of time
here...see below)
it starts off normally:
Indexing Data Source: "External-Program"
Indexing "spider.pl"
External Program found: /opt/lib/swish-e/spider.pl
Missing argument in sprintf at /opt/lib/swish-e/spider.pl line 38.
Missing argument in sprintf at /opt/lib/swish-e/spider.pl line 38.
/opt/lib/swish-e/spider.pl: Reading parameters from
'/share/MD0_DATA/swish-e-files/swish-e-conf/spider.config'
-- Starting to spider: http://localhost:104/_docs/test3/ --
?Testing 'test_url' user supplied function #1
'http://localhost:104/_docs/test3/'
+Passed all 1 tests for 'test_url' user supplied function
vvvvvvvvvvvvvvvv HEADERS for http://localhost:104/_docs/test3/
vvvvvvvvvvvvvvvvvvvvv
---- Request ------
HEAD http://localhost:104/_docs/test3/
Accept-Encoding: gzip, x-gzip, deflate
From: swish(at)not-real.user.failed.to.set.email.invalid
User-Agent: swish-e http://swish-e.org/
---- Response ---
Status: 200 OK
Date: Fri, 16 Mar 2012 13:53:03 GMT
Server: Apache
Content-Type: text/html;charset=ISO-8859-1
Client-Date: Fri, 16 Mar 2012 13:53:03 GMT
Client-Peer: 127.0.0.1:104
Client-Response-Num: 2
^^^^^^^^^^^^^^^ END HEADERS ^^^^^^^^^^^^^^^^^^^^^^^^^^
vvvvvvvvvvvvvvvv HEADERS for http://localhost:104/_docs/test3/
vvvvvvvvvvvvvvvvvvvvv
---- Request ------
GET http://localhost:104/_docs/test3/
Accept-Encoding: gzip, x-gzip, deflate
From: swish(at)not-real.user.failed.to.set.email.invalid
User-Agent: swish-e http://swish-e.org/
---- Response ---
Status: 200 OK
Date: Fri, 16 Mar 2012 13:53:03 GMT
Server: Apache
Content-Length: 423
Content-Type: text/html;charset=ISO-8859-1
Client-Date: Fri, 16 Mar 2012 13:53:03 GMT
Client-Peer: 127.0.0.1:104
Client-Response-Num: 3
^^^^^^^^^^^^^^^ END HEADERS ^^^^^^^^^^^^^^^^^^^^^^^^^^
>> +Fetched 0 Cnt: 1 GET http://localhost:104/_docs/test3/ 200 OK
text/html 423 parent: depth:0
Extracting links from http://localhost:104/_docs/test3/:
Looking at extracted tag '<a href="/_docs/">'
?Testing 'test_url' user supplied function #1 'http://localhost:104/_docs/'
+Passed all 1 tests for 'test_url' user supplied function
href="http://localhost:104/_docs/" Added to list of links to follow
Looking at extracted tag '<a href="Reception-duties-2.doc">'
?Testing 'test_url' user supplied function #1
'http://localhost:104/_docs/test3/Reception-duties-2.doc'
+Passed all 1 tests for 'test_url' user supplied function
href="http://localhost:104/_docs/test3/Reception-duties-2.doc" Added to
list of links to follow
Looking at extracted tag '<a href="Reception-duties.doc">'
?Testing 'test_url' user supplied function #1
'http://localhost:104/_docs/test3/Reception-duties.doc'
+Passed all 1 tests for 'test_url' user supplied function
href="http://localhost:104/_docs/test3/Reception-duties.doc" Added to
list of links to follow
! Found 3 links in http://localhost:104/_docs/test3/
Warning: document 'http://localhost:104/_docs/test3/' could not be encoded
to charset 'ISO-8859-1'
vvvvvvvvvvvvvvvv HEADERS for http://localhost:104/_docs/
vvvvvvvvvvvvvvvvvvvvv
***************************************************************************
****************************************************************************
There are lots of this kind of msg:
Looking at extracted tag '<a
href="Annual%20General%20Meeting%2031stAug_2009%20Draft%20Minutes.pdf">'
?Testing 'test_url' user supplied function #1
'http://localhost:104/_docs/Annual%20General%20Meeting%2031stAug_2009%20Draft%20Minutes.pdf'
+Passed all 1 tests for 'test_url' user supplied function
href="http://localhost:104/_docs/Annual%20General%20Meeting%2031stAug_2009%20Draft%20Minutes.pdf"
Added to list of links to follow
************************************************************************************
***********************************************************************************
when the encryted zip file is encountered:
http://localhost:104/_docs/2008%20Log%20of%20hours2.zip:49: error:
htmlParseStartTag: invalid element name
ì!ðÂÂR<»
^
httpÃcèÃü½±éÂeÂe(Âz-7HÂëþÃÂÃ}\¹Âcp¢ÃõI<2.zip:50: error:
htmlParseStartTag: invalid element name
^
http://localhost:104/_docs/2008%20Log%20of%20hours2.zip:51: error:
htmlParseEntityRef: no name
´n\ISÃÂÃ"dAäà &¨=«KþdKe(ÂÃÂÃ|ZÃÂñº=Ã!O^Ãý
^
http://localhost:104/_docs/2008%20Log%20of%20hours2.zip:51: error:
htmlParseStartTag: invalid element name
rZÃúlãB§O¿9¨T¨dÃ+¯:
5êêÃ
§>GmQ ÂÃ)Ã+éÃ
why is your .xls being indexed as .pdf?
What are the contents of
/share/MD0_DATA/swish-e-files/swish-e-conf/web_1.conf
?
again, break this down to a single URL to isolate your problem. Try
turning on
the spider debug options too:
http://swish-e.org/docs/spider.html#debug
--
Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com
_______________________________________________
Users mailing list
Users(at)not-real.lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Thanks!
_______________________________________________
Users mailing list
Users(at)not-real.lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Fri Mar 16 2012 - 14:13:29 GMT