Hi,
Using swish-e 2.4.5 I'm trying to index a web site with spider.pl but
run into parse errors.
Here is my setup:
swish.config:
SwishProgParameters /var/www/critik/swish/spider.config
IndexDir spider.pl
IndexFile /var/www/critik/swish/index.swish-e
spider.config:
@servers = ({
skip => 0, # skip spidering this server
base_url => 'http://trajan.apartia.fr/index.md',
agent => 'swish-e spider http://swish-e.org/',
email => 'swish@domain.invalid',
test_url => sub { $_[0]->path =~ /\.md?$/ },
delay_sec => 1, # Delay in seconds between requests
keep_alive => 1, # enable keep alives requests
} );
1;
Spidering itself runs fine:
% /usr/lib/swish-e/spider.pl /var/www/critik/swish/spider.config
[LOTS OF HTML OUTPUT]
Summary for: http://trajan.apartia.fr/index.md
Connection: Close: 1 (0.2/sec)
Connection: Keep-Alive: 32 (5.3/sec)
Duplicates: 233 (38.8/sec)
Off-site links: 38 (6.3/sec)
Skipped: 1,098 (183.0/sec)
Total Bytes: 711,676 (118612.7/sec)
Total Docs: 33 (5.5/sec)
Unique URLs: 33 (5.5/sec)
But when it comes to swish-e operation it breaks (see below). Do you see
anything obvious missing?
Thanks,
% swish-e -c swish.config -S prog
Indexing Data Source: "External-Program"
Indexing "spider.pl"
External Program found: /usr/lib/swish-e/spider.pl
/usr/lib/swish-e/spider.pl: Reading parameters from '/var/www/critik/swish/spider.config'
http://trajan.apartia.fr/index.md:158: error: htmlParseEntityRef: expecting ';'
r, qui sont tenues d'arborer une <a href="http://www.dessy.com/?go=dresses&style
^
Warning: Unknown header line: 'CTYPE html' from program spider.pl
Warning: Unknown header line: 'PUBLIC "-//W3C//DTD HTML 4.01//EN"' from program spider.pl
Warning: Unknown header line: '"http://www.w3.org/TR/html4/strict.dtd">' from program spider.pl
Warning: Unknown header line: '<html lang="fr-FR"><head><title>LesCulturelles.Net - théâtre: critiques et actualité</title>' from program spider.pl
Warning: Unknown header line: '<link rev="made" href="mailto:ldm%40apartia.fr">' from program spider.pl
Warning: Unknown header line: '<base href="http://trajan.apartia.fr/theatre.md">' from program spider.pl
Warning: Unknown header line: '<meta name="keywords" content="critique spectacle actualité culturelle paris théâtre cinéma' from program spider.pl
Warning: Unknown header line: 'littérature danse arts plastiques">' from program spider.pl
Warning: Unknown header line: '<meta name="copyright" content="Copyright 2007 Apartia">' from program spider.pl
Warning: Unknown header line: '<meta name="description" content="Toute l'actualité culturelle: critiques de théâtre, cinéma,' from program spider.pl
Warning: Unknown header line: 'littérature, danse">' from program spider.pl
Warning: Unknown header line: '<link title="Nos 10 dernières critiques" type="application/rss+xml" rel="alternate" href="http://www.lesculturelles.net/util/latest_reviews.mx">' from program spider.pl
Warning: Unknown header line: '<link title="L'actualité théâtrale" type="application/rss+xml" rel="alternate" href="http://www.lesculturelles.net/util/section.mx?display=Actualit%E9;ah_color=red;title=actualite;show_type=th%E9%E2tre;view=actualite">' from program spider.pl
Warning: Unknown header line: '<link rel="stylesheet" type="text/css" href="/css/style.css">' from program spider.pl
Warning: Unknown header line: '<script src="/js/common.js" type="text/javascript"></script>' from program spider.pl
Warning: Unknown header line: '</head>' from program spider.pl
Warning: Unknown header line: '<body>' from program spider.pl
err: External program failed to return required headers Path-Name:
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Thu Dec 6 13:46:44 2007