Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] Using ExtractPath to Exclude Some Subdirectory from Search Result

From: Ronny Rahardjo <rrahardjo(at)not-real.gmail.com>
Date: Fri Oct 23 2009 - 00:19:09 GMT
Hi Peter,

I made some progress. I was able to reduce the duplication. It seems the
root of the issue is on swishe.config when I specify the
SwishProgParameters.
I specified SwishProgParameters default
http://www.domainname.com/index.htmlalso
spiderconfig.pl
I decided to take out the default one and use only spiderconfig.pl and
everything seems to work fine.

The only problem now, the rank seems not correct for some result. The rank
can be at 477 in 5 row and then 366 in another 7 rows.
Also, sometime we cannot find the word we search but it appear on the result
page.

Thanks much.

On Fri, Sep 18, 2009 at 8:07 PM, Peter Karman <peter@peknet.com> wrote:

>
>
> Ronny Rahardjo wrote on 9/18/09 5:48 PM:
> > Hi Peter,
> >
> > Please ignore my question no.1. I was able to figure out which spider.pl
> > it is called. However, could you please let me know how can I check
> > whether my spider.pl is using spiderconfig.pl. I found spiderconfig.pl
> > in the same folder as swish.config, but I don't see any reference in the
> > spider.pl.
>
> try putting a:
>
>  die "yes, you are using me!";
>
> statement at the top of spiderconfig.pl and then run the spider.pl.
>
> However, this line in the config you posted here:
>
> SwishProgParameters default http://www.domainname.com/index.html
>
> suggests that you are using the default config, not your spiderconfig.plfile.
>
> >
> > And secondly, how can I exclude "a href=#tab" link in spider.pl
>
> I'm think spider.pl will ignore a link like '#tab' since that's just a
> self-referential link. Example:
>
> [karpet@pekmac:~/Sites]$ SPIDER_DEBUG=url,links spider.pl default
> http://localhost/~karpet/tab.html
> /Users/karpet/bin/spider.pl: Reading parameters from 'default'
>
>  -- Starting to spider: http://localhost/~karpet/tab.html --
> >> +Fetched 0 Cnt: 1 GET  http://localhost/~karpet/tab.html  200 OK
> text/html
> 141 parent: depth:0
>
> Extracting links from http://localhost/~karpet/tab.html:
>
> Looking at extracted tag '<a href="#tab">'
>  tag did not include any links to follow or is a duplicate
> Path-Name: http://localhost/~karpet/tab.html
> Content-Length: 141
> Last-Mtime: 1253329219
> Document-Type: html*
>
> <html>
>  <head>
>  <title>test doc</title>
>  </head>
>  <body>
>
>  foo bar <a href="#tab">nothing to see here</a> and more here
>
>  </body>
> </html>
>
>
> Summary for: http://localhost/~karpet/tab.html
> Connection: Close:   1  (1.0/sec)
>       Duplicates:   1  (1.0/sec)
>      Total Bytes: 141  (141.0/sec)
>       Total Docs:   1  (1.0/sec)
>      Unique URLs:   1  (1.0/sec)
>        text/html:   1  (1.0/sec)
>
>
>
>
> So I think you need to run spider.pl with your config against a test
> document
> and see what kind of output you get. Turn on the debugging options like I
> suggested. Ultimately, you're the only one who is going to discover the
> answer
> to your problem. I'm just suggesting approaches to try.
>
> --
>  Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
> _______________________________________________
> Users mailing list
> Users@lists.swish-e.org
> http://lists.swish-e.org/listinfo/users
>


_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Thu Oct 22 20:19:13 2009