Skip to main content.
home | support | download

Back to List Archive

Re: Fwd: swish fails to close file handles /pipes with prog

From: Oscar Marín <oscarmmiro(at)not-real.terra.es>
Date: Tue Jul 23 2002 - 11:36:55 GMT
Khalid Shukri escribió:

> I have a rather weird problem with swish-e:
> I'm trying to index a lot of sites (about 45000), using the prog  method with
> the spider.pl  (on a DSL line ) included in the windows binary distribution,
> but I want maximally 3 pages from each site. I tried to do this on an old PII
> with 64MB RAM running windows 2000. It started well (although slow), but became
> slower and slower while progressing through the 45000 urls, and, more, after
> some time, it  started reporting "Skipped" about every url. I thought this
> might be a problem with insufficient memory, swapping etc. I then divided the
> whole amount into chunks of 1000 which I indexed separately. This worked
> reasonably  well although still slow. Then I got my brand new p4 with 2 Giga
> RAM and 1 GHz CPU .-) on which I installed Debian. I then tried to search my
> old indexes from the windows machine, but swish-e always crashed on certain
> searchwords. (This is the second problem: Either the index files of the windows
> version is different from the linux version, or there's a bug in the linux
> version). I then indexed again, and on my new supercomputer the same thing as on
> the old windows machine happened.  I put a "open (LOG,file);print LOG
> something; close LOG;" in the test_url callback routine of the spider to find
> out what 's happening, but at a certain point, the programm stopped to write
> anything to the file , saying ("Can't write to closed file handle") . I then
> tried again to do the indexing in chunks of 1000, but this time started the
> whole 45 processes in parallel. After some time , I tried to open one of the
> log files see whats happening, but got the error: "Too many open files".
> So, my idea is the following: swish-e seems to open a file handle (or pipe? to
> the spider?) each time its moving to the next url, but fails to close it
> properly afterwards.
> Any help/suggestions available?
> Thanks in advance
> Khalida

apparently, you have too many open files in the system. This is a very common
problem
when opening lots of sockets (spidering). You start opening sockects, writing
downloaded
contents to files and suddenly you've ran out of file descriptor The default value
(in RedHat Linux)
is 4096. This is certainly  a low value for spidering and indexing at the same
time.
You can change the number of file-descriptors user-wide. The concrete method
depends on which
distribution you are using... If you want to know which your limit is just type:

cat proc/sys/fs/file-max

and to know the actual used fd's

watch cat /proc/sys/fs/file-nr

(you must be aware that... in order to reflect file limit changes you must also
change the i-node limit, tipically
3-4 times the file descriptor limits

to change your limits, please visit:

http://www.volano.com/linux.html

(at the end of page there's a "file descriptors" section

and

http://linuxperf.nl.linux.org/general/kerneltuning.html

(search for string "Increasing the Maximum number of file handles and the inode
cache")

seems like your problem is very common and easy to solve....

hope that helps and if so, please let me know... i'm about to write a more
down-to-earth linux tuning guide

bye,

Oscar Marin Miro

(by the way... i like your name!!)
Received on Tue Jul 23 11:40:26 2002