Skip to main content.
home | support | download

Back to List Archive

Filter.pm & Windows Thread safety

From: Job, James <JJob(at)not-real.ESD.WA.GOV>
Date: Sun May 23 2004 - 03:41:24 GMT
However, about 25-50 documents into my crawl, I'd start seeing "Skipped
whatever.doc due to filter 'filter_content' user supplied function #1.

Looking at task manager, I would see a running "catdoc" or "pdftotext"
process.  After tearing my hair out for a while, I suspected there may be a
threading issue (since I'm running a SMP system), and made some changes to
the windows_fork subroutine in Filter.pm.  I eventually had success with the
following:
#====================================================
sub windows_fork {
    my ( $self, @args ) = @_;

    require IPC::Open2;
    my ( $rdrfh, $wtrfh );

    my @command = map { s/"/\\"/g; qq["$_"] }  @args;

     
    my $pid = IPC::Open2::open2($rdrfh, $wtrfh, @command );

    # --- BEGIN WIN32 SMP MODS
    # Wait for Process to complete before we continue (max 10 sec), else
kill it!
    use POSIX ":sys_wait_h";
    my ($stiff, $tcks);
    $tcks = 0;
    while (($stiff=waitpid(-1,&WNOHANG))>0 && $tcks<9) {
    	sleep 1;
    	$tcks++;
    	}
    if ($tcks>8) {
    	$pid->Kill(9);
    	}
    # --- END WIN32 SMP MODS
    # IPC::Open3 uses binmode for some reason (5.6.1)
    # Assume that the output from the program will be in text
    # Maybe an invalid assumption if running through a binary filter

    binmode $rdrfh, ':crlf';  # perhpaps: unless delete
$self->{binary_output};

    $self->{pid} = $pid;

    return $rdrfh;
}
#====================================================

My original approaches:

1.  Just use Kill:  Sometimes killed the process before completion.
2.  Just use Waitpid:  Sometime process would not end (hangs the spider).

Hope this helps anyone who may have experienced a similar problem...

James Job, MCSE, MCP+I
Washington State Employment Security Department
Webmaster



*********************************************************************
Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
*********************************************************************
Received on Sat May 22 20:41:24 2004