Skip to main content.
home | support | download

Back to List Archive

Re: Indexing PDFs on Windows - Revisited....

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Sep 24 2004 - 22:57:34 GMT
On Fri, Sep 24, 2004 at 03:30:53PM -0700, Anthony Baratta wrote:
> 	http://test.portofoakland.com/PDF_TestPage.html

Seems to work ok for me.  So far I've grabbed 28

    $ fgrep Path-Name spider.out | wc -l
    28


Some of those PDFs are killer, though.  pdftotext sucks up my CPU.


That checking for too big is a bit dumb -- it doesn't use the
content-length heaader but instead downloads it.  A new spider.pl will
be created soon...


> 
>   ++Checking filter [SWISH::Filters::Pdf2HTML=HASH(0x257df08)] for 
> application/pdf
>   (202 words)
> Problems with filter 'SWISH::Filters::Pdf2HTML=HASH(0x257df08)'.  Filter 
> disabled:
>   -> open2: IO::Pipe: Can't spawn-NOWAIT: Resource temporarily 
> unavailable at C:\
> Progra~1\SWISH-E\lib\swish-e\perl/SWISH/Filter.pm line 1158

Yuck.  I wonder what resource they are talking about.

Well, maybe you can help.

Under non-Windows to run an external program I use fork/exec.  Can't
do that on Windows.

So the issue is how to run an external program *safely* under Windows.
What I'm currently using is IPC::Open3, which under windows is suppose
to avoid the shell (although it seems like the data is still messed
with by Windows).

I'm sure there's a better way to run a program from Perl under
Windows, but I've never found anyone that could help.

> The Local Time is embedded at the first real text in the page via a time 
> function. But strangely none of the other text on the page shows up.

Are you asking it to capture enough bytes?

   StoreDescription HTML* <body> 1000000


-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Fri Sep 24 15:57:49 2004