Hi experts,
I am a swish-e newbie, and am trying to index hypermail archives on a remote web servers.
swish-e version: 2.4.5
OS: GNU/Linux
The configure file:
$cat swish.conf
IndexDir spider.pl
SwishProgParameters spider.conf
IndexOnly .htm .html .txt .pdf .doc .ppt .xml
IndexContents TXT* .txt
DefaultContents HTML*
ParserWarnLevel 9
$cat spider.conf
my %theta13_general = (
email => 'tianxc@ihep.ac.cn',
base_url => 'https://www.lbl.gov/lists.archives/theta13-general.archive/',
delay_sec => '0',
max_depth => '1',
credentials => 'username:password'
);
@servers = ( \%theta13_general );
1;
when I run
$swish-e -c swish.conf -S prog
the error messages are:
===========================================================================
Indexing Data Source: "External-Program"
Indexing "spider.pl"
/usr/local/lib/swish-e/spider.pl: Reading parameters from 'spider.conf'
https://www.lbl.gov/lists.archives/theta13-general.archive/:1: error: htmlParseStartTag: invalid element name
<?xml version="1.0" encoding="ISO-8859-1"?>
^
https://www.lbl.gov/lists.archives/theta13-general.archive/:2: error: Misplaced DOCTYPE declaration
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
^
External Program found: /usr/local/lib/swish-e/spider.pl
Warning: Unknown header line: 'g="ISO-8859-1"?>' from program spider.pl
Warning: Unknown header line: '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"' from program spider.pl
Warning: Unknown header line: '"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">' from program spider.pl
Warning: Unknown header line: '<html xmlns="http://www.w3.org/1999/xhtml" lang="en">' from program spider.pl
Warning: Unknown header line: '<head>' from program spider.pl
Warning: Unknown header line: '<meta name="generator" content="hypermail 2.2.0, see http://www.hypermail-project.org/" />' from program spider.pl
Warning: Unknown header line: '<title>theta13-general list: by author</title>' from program spider.pl
Warning: Unknown header line: '<meta name="Subject" content="by author" />' from program spider.pl
Warning: Unknown header line: '<style type="text/css">' from program spider.pl
Warning: Unknown header line: '/*<![CDATA[*/' from program spider.pl
...................
Warning: Unknown header line: '</map>' from program spider.pl
Warning: Unknown header line: '<!-- trailer="footer" -->' from program spider.pl
Warning: Unknown header line: '<p><small><em>' from program spider.pl
Warning: Unknown header line: 'This archive was generated by <a href="http://www.hypermail-project.org/">hypermail 2.2.0</a>' from program spider.pl
Warning: Unknown header line: ': Sat Mar 01 2008 - 22:40:07 PST' from program spider.pl
Warning: Unknown header line: '</em></small></p>' from program spider.pl
Warning: Unknown header line: '</div>' from program spider.pl
Warning: Unknown header line: '</body>' from program spider.pl
Warning: Unknown header line: '</html>' from program spider.pl
Warning: Unknown header line: '://www.lbl.gov/lists.archives/theta13-general.archive/1237.html' from program spider.pl
err: External program failed to return required headers Path-Name:
.
===============================================================================
Any help? Thanks in advance.
Best Regards,
Xinchun Tian
2008-03-02
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Sun Mar 2 01:59:04 2008