Skip to main content.
home | support | download

Back to List Archive

Re: Indexing PDFs on Windows - Revisited....

From: Anthony Baratta <Anthony(at)>
Date: Sat Sep 25 2004 - 00:32:35 GMT
Bill Moseley wrote:

> Some of those PDFs are killer, though.  pdftotext sucks up my CPU.
> That checking for too big is a bit dumb -- it doesn't use the
> content-length heaader but instead downloads it.  A new will
> be created soon...

I saw your output from the test page. Looks like this is just a windows 

> Yuck.  I wonder what resource they are talking about.
> Well, maybe you can help.
> Under non-Windows to run an external program I use fork/exec.  Can't
> do that on Windows.
> So the issue is how to run an external program *safely* under Windows.
> What I'm currently using is IPC::Open3, which under windows is suppose
> to avoid the shell (although it seems like the data is still messed
> with by Windows).
> I'm sure there's a better way to run a program from Perl under
> Windows, but I've never found anyone that could help.

I'll take a look. I have a Perl:Win32 book at the office I'll check out 
on Monday - let you know what I find there.

I did find this:

[16] Cross-platform IPC:

Usual idioms for opening pipe and socket handles are portable. However, 
inheriting or passing handles across processes is not supported under 
Windows. For this reason, use of fork and exec to share handles between 
related processes is considered unportable. Use IPC::Open2 and 
IPC::Open3 instead. For more details, see:

I did find your discussion on the list back in January regarding 
"waitpid". I put the waitpid into the code where suggested and the 
spider froze up on the indexing the HTML page.

I also found this:

"The idea is as follows :
- each time you start a child process, it occupies a new position
   in some internal process table, which can contain maximum 64 processes
   under NT
- even if the child process terminates, that position in the internal
   process table is still occupied, and the only way to clear it and
   make it available again, is to do a waitpid(), with the process-id
   of the (now dead) child.

   That process-id is what is (originally, right after the call to
   open2()) in the $ExtPid variable.

   So you must find out, in your program, when the child process has
   finished it's work, and then do a waitpid() on it's process-id to
   clean the table.

   If you don't do that, then slowly the table gets full, and when
   you have 64 "dead" processes in it, you will get this error when
   you try to start another one."

This makes a ton of sense.

I played around with different locations for waitpid $self->{pid},0; and 
could not find a location that didn't cause the spider to hang or still 
lock up.

Looks like this is an issue with not cleaning up dead child processes. 
I'm just not sure where the best location is to do the clean up - my 
"advanced perl" is very poor.

> Are you asking it to capture enough bytes?
>    StoreDescription HTML* <body> 1000000

Maybe not. I was defaulting the text captured to 320 because I wanted it 
to work similarly as the Windows Index server (which we are attempting 
to replace). I'll keep working with it.
Received on Fri Sep 24 17:32:50 2004