For more information, i found my spider.pl is in apache/cgi-bin folder. So
when I run my scheduledtask, it run just fine, but when I run using command
prompt and called swishe.config, it threw an error I sent you previously, no
such directory for spider.pl.
Please give me a pointer to debug which file the system used. Thanks.
On Fri, Sep 18, 2009 at 8:07 PM, Peter Karman <email@example.com> wrote:
> Ronny Rahardjo wrote on 9/18/09 5:48 PM:
> > Hi Peter,
> > Please ignore my question no.1. I was able to figure out which spider.pl
> > it is called. However, could you please let me know how can I check
> > whether my spider.pl is using spiderconfig.pl. I found spiderconfig.pl
> > in the same folder as swish.config, but I don't see any reference in the
> > spider.pl.
> try putting a:
> die "yes, you are using me!";
> statement at the top of spiderconfig.pl and then run the spider.pl.
> However, this line in the config you posted here:
> SwishProgParameters default http://www.domainname.com/index.html
> suggests that you are using the default config, not your spiderconfig.plfile.
> > And secondly, how can I exclude "a href=#tab" link in spider.pl
> I'm think spider.pl will ignore a link like '#tab' since that's just a
> self-referential link. Example:
> [karpet@pekmac:~/Sites]$ SPIDER_DEBUG=url,links spider.pl default
> /Users/karpet/bin/spider.pl: Reading parameters from 'default'
> -- Starting to spider: http://localhost/~karpet/tab.html --
> >> +Fetched 0 Cnt: 1 GET http://localhost/~karpet/tab.html 200 OK
> 141 parent: depth:0
> Extracting links from http://localhost/~karpet/tab.html:
> Looking at extracted tag '<a href="#tab">'
> tag did not include any links to follow or is a duplicate
> Path-Name: http://localhost/~karpet/tab.html
> Content-Length: 141
> Last-Mtime: 1253329219
> Document-Type: html*
> <title>test doc</title>
> foo bar <a href="#tab">nothing to see here</a> and more here
> Summary for: http://localhost/~karpet/tab.html
> Connection: Close: 1 (1.0/sec)
> Duplicates: 1 (1.0/sec)
> Total Bytes: 141 (141.0/sec)
> Total Docs: 1 (1.0/sec)
> Unique URLs: 1 (1.0/sec)
> text/html: 1 (1.0/sec)
> So I think you need to run spider.pl with your config against a test
> and see what kind of output you get. Turn on the debugging options like I
> suggested. Ultimately, you're the only one who is going to discover the
> to your problem. I'm just suggesting approaches to try.
> Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com
> Users mailing list
Users mailing list
Received on Thu Oct 22 18:12:12 2009