Skip to main content.
home | support | download

Back to List Archive

Re: ReplaceRules not working as advertised

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Mon Apr 22 2002 - 19:01:55 GMT
At 11:42 AM 04/22/02 -0700, Colin Kuskie wrote:
>I found that I was getting "duplicate" results when indexing:
>
>1000 http://www.sunsetpres.org/Men/ "Sunset Presbyterian Men's Ministry
Page" 29670
>1000 http://www.sunsetpres.org/Men/index.html "Sunset Presbyterian Men's
Ministry Page" 29670

Two different URLs.


>Based on reading the docs, I expected it to merge the results for
>the two URLs, since they say:
>
>           ReplaceRules allows you to make changes to file path­
>           names before they're indexed.  These changed file
>           names or URLs will be returned in search results.

Yes perhaps not the best wording.

You can change the name of of the path stored in the index with
ReplaceRules, but it doesn't effect what is sent to swish for indexing.
That's before indexing, not before spidering a URL.

In other words think of it as a pipe

   spider | swish

spider is just passing files to swish, and swish can tell spider anything.  

If you are using -S http you might be able to edit the swishspider perl
program and add "index.html" to any links that end in a slash.  But that
won't fix links that forget the trailing slash (and generate a redirect).

You could even use MD5 in swishspider, but you would need to store the keys
on disk since swishspider is run for every URL.

-S prog with spider.pl is a lot more flexible.  And probably faster, too,
since it avoids compiling a perl program for every URL.




-- 
Bill Moseley
mailto:moseley@hank.org
Received on Mon Apr 22 19:02:01 2002