Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] Using ExtractPath to Exclude Some Subdirectory from Search Result

From: Ronny Rahardjo <rrahardjo(at)not-real.gmail.com>
Date: Thu Oct 22 2009 - 21:40:26 GMT
Hi Peter,

Sorry, I am coming back onto my issue. Unfortunately, I am still unable to
solve my issue.
First of all, there is a lot of duplicate files floating around in different
folders and I would like to understand which file is being called.
Back to you suggestion of using die 'this file I use', and run spider.pl.
Could you please let me know how to run the spider.pl.
All I know is that run the RunIndex.bat and when I try to run using command
prompt, C:\SWISH-E\bin>swish-e.exe -c swishe.config, I got the following
message:

Indexing Data Source: 'File-System"
Indexing "spider.pl"

Warning: Invalid path 'spider.pl': No such file or directory

Any idea...

On Fri, Sep 18, 2009 at 8:07 PM, Peter Karman <peter@peknet.com> wrote:

>
>
> Ronny Rahardjo wrote on 9/18/09 5:48 PM:
> > Hi Peter,
> >
> > Please ignore my question no.1. I was able to figure out which spider.pl
> > it is called. However, could you please let me know how can I check
> > whether my spider.pl is using spiderconfig.pl. I found spiderconfig.pl
> > in the same folder as swish.config, but I don't see any reference in the
> > spider.pl.
>
> try putting a:
>
>  die "yes, you are using me!";
>
> statement at the top of spiderconfig.pl and then run the spider.pl.
>
> However, this line in the config you posted here:
>
> SwishProgParameters default http://www.domainname.com/index.html
>
> suggests that you are using the default config, not your spiderconfig.plfile.
>
> >
> > And secondly, how can I exclude "a href=#tab" link in spider.pl
>
> I'm think spider.pl will ignore a link like '#tab' since that's just a
> self-referential link. Example:
>
> [karpet@pekmac:~/Sites]$ SPIDER_DEBUG=url,links spider.pl default
> http://localhost/~karpet/tab.html
> /Users/karpet/bin/spider.pl: Reading parameters from 'default'
>
>  -- Starting to spider: http://localhost/~karpet/tab.html --
> >> +Fetched 0 Cnt: 1 GET  http://localhost/~karpet/tab.html  200 OK
> text/html
> 141 parent: depth:0
>
> Extracting links from http://localhost/~karpet/tab.html:
>
> Looking at extracted tag '<a href="#tab">'
>  tag did not include any links to follow or is a duplicate
> Path-Name: http://localhost/~karpet/tab.html
> Content-Length: 141
> Last-Mtime: 1253329219
> Document-Type: html*
>
> <html>
>  <head>
>  <title>test doc</title>
>  </head>
>  <body>
>
>  foo bar <a href="#tab">nothing to see here</a> and more here
>
>  </body>
> </html>
>
>
> Summary for: http://localhost/~karpet/tab.html
> Connection: Close:   1  (1.0/sec)
>       Duplicates:   1  (1.0/sec)
>      Total Bytes: 141  (141.0/sec)
>       Total Docs:   1  (1.0/sec)
>      Unique URLs:   1  (1.0/sec)
>        text/html:   1  (1.0/sec)
>
>
>
>
> So I think you need to run spider.pl with your config against a test
> document
> and see what kind of output you get. Turn on the debugging options like I
> suggested. Ultimately, you're the only one who is going to discover the
> answer
> to your problem. I'm just suggesting approaches to try.
>
> --
>  Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
> _______________________________________________
> Users mailing list
> Users@lists.swish-e.org
> http://lists.swish-e.org/listinfo/users
>


_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Thu Oct 22 17:40:30 2009