On Tue, Dec 16, 2003 at 02:20:11PM -0800, Brad_Horstkotte@capgroup.com wrote:
> I've been poking around trying to figure out how to get PDF indexing to
> work, and haven't had any luck - I'm running into the same problem which
> was discussed on this thread (null characters in the PDF files being
> replaced with line feed characters, and later on the PDF is seen as
> invalid):
>
> http://swish-e.org/archive/4511.html
>
> Has this problem been fixed?
I think so. But that's not to say it isn't happening somewhere else in
the chain. When using a filter with -S prog swish-e doesn't replace
nulls with \n. But, swish-e is reading the entire file into memory and
then writing it out to a temp file before calling the filter. So
something could be happening there -- or mabye it's how the spider is
fetching it.
> The PDFs convert fine when running _pdf2html.pl from the command line on
> the file, but fail when converted via the spider.
Well, what I'd do is edit _pdf2html.pl and do something like:
system("copy $file c:\test.pdf");
assuming that works on windows. That will allow you to see if indeed
the copy is the same as the original (you can check by file size -- I'm
not sure what Windows provides for comparing files).
The bit of debugging I'd do is run the spider to just fetch the pdf file
and save its output to a file. Look at the first few lines of the file
and see if the content-length is what you expect. No, that might not
work. Windows uses \r\n on disk but inside perl and C the data only
contains \n so the content-length might be different.
I have also written the output from the spider to a file, used an editor
to remove the header lines and then compare the files. It's a pain.
> I saw SWISH::Filter mentioned as an alternative, but so far have avoided it
> since I'm a perl dolt, and it looked like less of a turnkey alternative.
No it's more turnkey. If you use the "default" mode it should know how
to decode it:
$ /usr/local/lib/swish-e/spider.pl default http://localhost/apache/test.pdf | head
/usr/local/lib/swish-e/spider.pl: Reading parameters from 'default'
Path-Name: http://localhost/apache/test.pdf
Content-Length: 12593
Last-Mtime: 1064946675
Document-Type: HTML*
<html>
<head>
<meta name="author" content=" ">
<meta name="creationdate" content="Fri Mar 21 21:42:23 2003">
<meta name="creator" content="Microsoft Word: AdobePS 8.7.3 (301)">
Here it is on Windows (sorry for the wrapping):
E:\SWISH-E>perl lib/swish-e/spider.pl default
"http://bumby/apache/test.pdf" | head
lib/swish-e/spider.pl: Reading parameters from 'default'
Can't use keep-alive: conn_cache method not available
Summary for: http://bumby/apache/test.pdf
Total Bytes: 12,579 (12579.0/sec)
Total Docs: 1 (1.0/sec)
Unique URLs: 1 (1.0/sec)
Path-Name: http://bumby/apache/test.pdf
Content-Length: 12579
Last-Mtime: 1064946675
Document-Type: HTML*
<html>
<head>
<meta name="author" content=" ">
<meta name="creationdate" content="03/21/03 21:42:23">
<meta name="creator" content="Microsoft Word: AdobePS 8.7.3 (301)">
And even piping to swish:
E:\SWISH-E>perl lib/swish-e/spider.pl default
"http://bumby/apache/test.pdf" | s
wish-e -S prog -i stdin
lib/swish-e/spider.pl: Reading parameters from 'default'
Can't use keep-alive: conn_cache method not available
Summary for: http://bumby/apache/test.pdf
Total Bytes: 12,579 (12579.0/sec)
Total Docs: 1 (1.0/sec)
Unique URLs: 1 (1.0/sec)
Indexing Data Source: "External-Program"
Indexing "stdin"
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 813 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: Complete
Writing word hash: Complete
Writing word data: Complete
813 unique words indexed.
4 properties sorted.
1 file indexed. 12579 total bytes. 2299 total words.
Elapsed time: 00:00:01 CPU time: 00:00:00
Indexing done!
--
Bill Moseley
moseley@hank.org
Received on Wed Dec 17 01:16:31 2003