Skip to main content.
home | support | download

Back to List Archive

Re: Logging the indexing in a file

From: <moseley(at)not-real.hank.org>
Date: Tue Aug 19 2003 - 12:39:25 GMT
On Tue, Aug 19, 2003 at 01:18:38AM -0700, Bucharow Leonard wrote:
> 
> Hi Bill and Co.,
> 
> first I may not understand, what you mean with:
> > How do humans without javascript follow those links?
> anyway I unfortunately can't influence the humans to create links with
> HTML/XML or th. else then java-plug-in. 

I see people use javascript for links where normal html links work fine.  
I use one online (bill paying) service where they use javascript links 
for a lot of their navigation.  Half the time they don't work right and 
my forward and back buttons don't work as expected.  And I do turn off 
javascript at times and it always takes me a few minutes figure out why 
things are not working.

> Second I have two another questions:
> 
> Can SWISH-E write IndexReport in a file (f.e. during executing a cron job)?
> If yes, how?

My opinion is that cron jobs are better if they only report errors.  
Otherwise you start ignoring the logs.

So I use

    swish-e -c config -v0

Otherwise, pipe swish-e's output to grep or awk or perl and extract out 
the data you want logged.  Swish writes \r to overwrite the percentage 
complete, so just writing that to a file might not look too good -- 
which is why I suggest piping to some program to filter out the data you 
want to keep.


> 
> I'm trying to spider not the entire web server but only a web folder (f.e. I
> may not to spider the apache manual).
> In the SwishpiderConfig.pl I've set the option:
> base_url => http://host/intranet/
> But spider.pl indexes the entire web server! Do I something wrong?
> I've excluded the folder with robots.txt, but I don't understand, why can't
> I set up the folder to index? 

The only limitation is that it only indexes one server (host name) at a
time (per section of the spider config file).  If you set 

  base_url => http://host/foo_directory

there's nothing to keep it from indexing any other directory on "host".
But you can use robots.txt to limit what is indexed.  You can also setup 
a "test_url()" callback function to limit to, say, just the 
"foo_directory" directory.  See:

 http://swish-e.org/dev/docs/spider.html#CALLBACK_FUNCTIONS




-- 
Bill Moseley
moseley@hank.org
Received on Tue Aug 19 12:39:46 2003