Skip to main content.
home | support | download

Back to List Archive

Re: SWISH-E index limits

From: Linda DeBoer <lindad(at)not-real.geac.com>
Date: Mon Apr 22 2002 - 18:11:24 GMT
G'day
	Thanks much. I am currently trying a few things here again. The
nutshell problem is that I told it to go to "www.leftmind.net/~adb/treb", it
spidered the urls fine, then found the "Back to Anthony's Homepage" and then
appears to have gone back to "www.leftmind.net/~adb/treb". The trace is
long. I am working on setting up a shorter version for debugging. I also
found the section that says "When posting, please include" (newbie) and am
trying to line up my dragons....;-)

Just for the record:
-  I am using SWISH-E 2.0. 
-  Earlier today I used wget to make sure it could grab it. 
-  I've just finished spidering the local version I grabbed. It's ok.

I'll do some more reading, playing and likely load 2.1 to see how it's
different.

 
>-----Original Message-----
>From: Gerald Klaas [mailto:gklaas@arb.ca.gov]
>Sent: April 22, 2002 10:51
>To: Multiple recipients of list
>Subject: [SWISH-E] Re: SWISH-E index limits
>
>
>
>
>Bill Moseley wrote:
>> 
>> At 10:03 AM 04/22/02 -0700, Linda DeBoer wrote:
>> >       Whenever I run swish-e against a site which has a 
>url pointing back
>> >to the home page, it loops.
>> 
>> You don't mean "loop" in that it indexes the same URL more 
>than once, right?
>> 
>
>It might if there is an equivalent URL not configured with the
>EquivalentServer directive.  I.e.  http://www.sacto.com/ and 
>http://sacto.com/
>are two URL's for the same page. So wouldn't you need () in 
>your config file ?
>EquivalentServer http://sacto.com http://www.sacto.com
>
>Or if the links back to the homepage, are not consistent, you might
>also wind up with things like () being indexed separately.
>http://sacto/
>http://sacto.com/index.htm
>http://www.sacto.com/index.htm
>And then possibilities of case insensitivity if the host is MS-based
>http://www.sacto.com/Index.htm
>http://www.sacto.com/INDEX.htm
>http://www.sacto.com/INDEX.HTM
>
>
>> But, if you are using 2.1-dev, and the -S prog method with 
>spider.pl then
>> it's rather easy to do this.
>> 
>> In the config you can say:
>> 
>>   test_url => sub {
>>       my $uri = shift;
>>       return $uri->path =~ m!^/some/path!;
>>   }
>> 
>
>I do this. Just like Bill says, it works like a charm.   :-)
>If you want to see how I use this, you can check the 
>"spider configuration template" link from
>http://www.arb.ca.gov/db/search/swishe/swishe.htm
>
>> Another option, which would be fast, would be to run another web
>> server/virtual host on a different port, and change the 
>document root.
>> 
>Interesting.  Then you'd use the ReplaceRules directive to
>rewrite the URL as it goes into the index? 
>
>Gerald
>
Received on Mon Apr 22 18:12:50 2002