However, about 25-50 documents into my crawl, I'd start seeing "Skipped
whatever.doc due to filter 'filter_content' user supplied function #1.
Looking at task manager, I would see a running "catdoc" or "pdftotext"
process. After tearing my hair out for a while, I suspected there may be a
threading issue (since I'm running a SMP system), and made some changes to
the windows_fork subroutine in Filter.pm. I eventually had success with the
following:
#====================================================
sub windows_fork {
my ( $self, @args ) = @_;
require IPC::Open2;
my ( $rdrfh, $wtrfh );
my @command = map { s/"/\\"/g; qq["$_"] } @args;
my $pid = IPC::Open2::open2($rdrfh, $wtrfh, @command );
# --- BEGIN WIN32 SMP MODS
# Wait for Process to complete before we continue (max 10 sec), else
kill it!
use POSIX ":sys_wait_h";
my ($stiff, $tcks);
$tcks = 0;
while (($stiff=waitpid(-1,&WNOHANG))>0 && $tcks<9) {
sleep 1;
$tcks++;
}
if ($tcks>8) {
$pid->Kill(9);
}
# --- END WIN32 SMP MODS
# IPC::Open3 uses binmode for some reason (5.6.1)
# Assume that the output from the program will be in text
# Maybe an invalid assumption if running through a binary filter
binmode $rdrfh, ':crlf'; # perhpaps: unless delete
$self->{binary_output};
$self->{pid} = $pid;
return $rdrfh;
}
#====================================================
My original approaches:
1. Just use Kill: Sometimes killed the process before completion.
2. Just use Waitpid: Sometime process would not end (hangs the spider).
Hope this helps anyone who may have experienced a similar problem...
James Job, MCSE, MCP+I
Washington State Employment Security Department
Webmaster
*********************************************************************
Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
*********************************************************************
Received on Sat May 22 20:41:24 2004