Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] Using ExtractPath to Exclude Some Subdirectory from Search Result

From: Ronny Rahardjo <rrahardjo(at)not-real.gmail.com>
Date: Thu Oct 22 2009 - 22:51:16 GMT
Hi Peter,

Sorry for the repeating email. Its been couple of days since I am trying to
figure out this tool and hoping that you can help me.
Briefly about my issue, the tool worked just fine in the past, before we
rebuild our site with the new content and look and feel. When I replace my
index file with the old one, it works just fine. However when I run my
swishe and reindex, it brake the result.
The rank wasn't right, the result is duplicating, etc. So, I concluded that
the issue is on the indexing, not on the searching.

What I thought might be the problem is, in the new site, we implement a
multiple tab views within a page.

And now slowly, I am start getting the path here:
1. We have swishe.config
2. Under swishe.config we specify the IndexDir spider.pl and IndexFile
swish.idx
3. However under spider.pl, I don't see any customization, everything is
original code.
So, I guess, I can just focusing on my spider.pl and swishe.config. Can I
send you this two files for you to take a look?
If so, do you have any other private email which I should send or this one
is fine? i can also give you the URL to take a look at the issue, because
the site is up and running for public.

Looking forward to hear from you.

Thanks.
On Fri, Sep 18, 2009 at 8:07 PM, Peter Karman <peter@peknet.com> wrote:

>
>
> Ronny Rahardjo wrote on 9/18/09 5:48 PM:
> > Hi Peter,
> >
> > Please ignore my question no.1. I was able to figure out which spider.pl
> > it is called. However, could you please let me know how can I check
> > whether my spider.pl is using spiderconfig.pl. I found spiderconfig.pl
> > in the same folder as swish.config, but I don't see any reference in the
> > spider.pl.
>
> try putting a:
>
>  die "yes, you are using me!";
>
> statement at the top of spiderconfig.pl and then run the spider.pl.
>
> However, this line in the config you posted here:
>
> SwishProgParameters default http://www.domainname.com/index.html
>
> suggests that you are using the default config, not your spiderconfig.plfile.
>
> >
> > And secondly, how can I exclude "a href=#tab" link in spider.pl
>
> I'm think spider.pl will ignore a link like '#tab' since that's just a
> self-referential link. Example:
>
> [karpet@pekmac:~/Sites]$ SPIDER_DEBUG=url,links spider.pl default
> http://localhost/~karpet/tab.html
> /Users/karpet/bin/spider.pl: Reading parameters from 'default'
>
>  -- Starting to spider: http://localhost/~karpet/tab.html --
> >> +Fetched 0 Cnt: 1 GET  http://localhost/~karpet/tab.html  200 OK
> text/html
> 141 parent: depth:0
>
> Extracting links from http://localhost/~karpet/tab.html:
>
> Looking at extracted tag '<a href="#tab">'
>  tag did not include any links to follow or is a duplicate
> Path-Name: http://localhost/~karpet/tab.html
> Content-Length: 141
> Last-Mtime: 1253329219
> Document-Type: html*
>
> <html>
>  <head>
>  <title>test doc</title>
>  </head>
>  <body>
>
>  foo bar <a href="#tab">nothing to see here</a> and more here
>
>  </body>
> </html>
>
>
> Summary for: http://localhost/~karpet/tab.html
> Connection: Close:   1  (1.0/sec)
>       Duplicates:   1  (1.0/sec)
>      Total Bytes: 141  (141.0/sec)
>       Total Docs:   1  (1.0/sec)
>      Unique URLs:   1  (1.0/sec)
>        text/html:   1  (1.0/sec)
>
>
>
>
> So I think you need to run spider.pl with your config against a test
> document
> and see what kind of output you get. Turn on the debugging options like I
> suggested. Ultimately, you're the only one who is going to discover the
> answer
> to your problem. I'm just suggesting approaches to try.
>
> --
>  Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
> _______________________________________________
> Users mailing list
> Users@lists.swish-e.org
> http://lists.swish-e.org/listinfo/users
>


_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Thu Oct 22 18:51:19 2009