Skip to main content.
home | support | download

Back to List Archive

Indexing PDFs on Windows - Revisited....

From: Anthony Baratta <anthony(at)not-real.2plus2partners.com>
Date: Wed Sep 22 2004 - 22:22:25 GMT
I finally found time to return to try using swish-e in a Windows 
environment. Hopefully the information I can provide here can trigger an 
idea and/or a path I can pursue to complete this project. (If you need any 
more information regarding setup and usage, please let me know.)

OS: Windows 2000 full patched.
Swish-e: v 2.4.2
pdftotext: v 3.00
pdfinfo: v 3.00
perl: v5.6.1 ActiveState build 638

When I run "swish-filter-test", pdftotext is found and loaded successfully.

I've tried two different approaches to this issue:

OPTION ONE

index_port.bat
	"C:\Program Files\SWISH-E\swish-e.exe"
		-S prog -v 3 -c
		"C:\Program Files\SWISH-E\indexes\Port\port.config"
		-f "C:\Program Files\SWISH-E\indexes\Port\index.swish-e"

port.config

	DefaultContents HTML2
	IndexContents HTML* .asp .htm .html .shtml .pdf
	StoreDescription HTML* <body> 320

	IndexDir perl.exe
	TmpDir "C:\\Progra~1\\SWISH-E\\indexes\\Tmp\\"
	SwishProgParameters
		"C:\\Progra~1\\SWISH-E\\lib\\swish-e\\spider.pl"
		"C:\\Progra~1\\SWISH-E\\indexes\\Port\\port.spider"
	ReplaceRules remove http://local.dev.port.com

port.spider

     @servers = (
         {
             base_url    => 'http://local.dev.port.com',
             email       => 'tony@2plus2.com',
             delay_sec   => 1,
             debug       => DEBUG_URL | DEBUG_INFO | DEBUG_FAILED | 
DEBUG_SKIPPED,
             # other spider settings described below
         },
     );

The output for this option is a bit strange....while it attempts to index 
the site, it fails to record word count for pages after the 16th link. This 
link is a PDF and the spider appears to lockup on analyzing it and while it 
fetches all the other links it finds, it fails to index these pages.

 >> +Fetched 1 Cnt: 15 
http://local.dev.port.com/newsroom/pressrel/pressrel_163.asp
	200 OK text/html 23871
	parent:http://local.dev.port.com
	! Found 0 links in 
http://local.dev.port.com/newsroom/pressrel/pressrel_163.asp
	(585 words)
	http://local.dev.port.com/newsroom/pressrel/pressrel_163.asp
		- Using HTML2 parser - sleeping 1 seconds
	
 >> +Fetched 1 Cnt: 16 http://local.dev.port.com/pdf/publ_notice2.pdf
	200 OK application/pdf 44239
	parent:http://local.dev.port.com
	(1357 words)
	http://local.dev.port.com/pdf/publ_notice2.pdf
		- Using HTML2 parser -  (59 words) sleeping 1 seconds
		
 >> +Fetched 1 Cnt: 17 
http://local.dev.port.com/newsroom/pressrel/pressrel_162.asp
	200 OK text/html 20249
	parent:http://local.dev.port.com
	! Found 0 links in 
http://local.dev.port.com/newsroom/pressrel/pressrel_162.asp
		sleeping 1 seconds

 >> +Fetched 1 Cnt: 18 http://local.dev.port.com/portnyou/offi_seni.asp
	200 OK text/html 30543
	parent:http://local.dev.port.com
	! Found 1 links in http://local.dev.port.com/portnyou/offi_seni.asp
		sleeping 1 seconds

As you can see from #17 on, there are no word counts. At the end of #16 
there is this weirdness: "- Using HTML2 parser -  (59 words) sleeping 1 
seconds". Seems like commands are stepping on eachother. Then from #17 on - 
it fails to index the fetched pages. The summary of work looks like this:

	Summary for: http://local.dev.port.com
	Connection: Close:         991  (0.9/sec)
	       Duplicates:      21,123  (19.5/sec)
	   Off-site links:       3,501  (3.2/sec)
	          Skipped:          21  (0.0/sec)
	      Total Bytes: 138,730,121  (128098.0/sec)
	       Total Docs:         968  (0.9/sec)
	      Unique URLs:         992  (0.9/sec)

	Removing very common words...
	no words removed.
	Writing main index...
	Sorting words ...
	Sorting 1,459 words alphabetically
	Writing header ...
	Writing index entries ...
	  Writing word text: Complete
	  Writing word hash: Complete
	  Writing word data: Complete
	1,459 unique words indexed.
	5 properties sorted.
	16 files indexed.  435,308 total bytes.  5,753 total words.
	Elapsed time: 00:18:06 CPU time: 00:18:06
	Indexing done!

Even though it found 992 pages, it only indexed 16. I can't get swish to 
throw any other errors that appear to be relevant to the issue. Any 
suggestions where to start looking are appreciated.

OPTION TWO

index_port.bat

	"C:\Program Files\SWISH-E\swish-e.exe"
		-S http -v 3 -c
		"C:\Program Files\SWISH-E\indexes\Port\port.config"
		-f "C:\Program Files\SWISH-E\indexes\Port\index.swish-e"

port.config

	IndexDir http://local.dev.port.com
	TmpDir "C:\\Progra~1\\SWISH-E\\indexes\\Tmp\\"

	IndexOnly .asp .htm .html .shtml .pdf
	FileFilter .pdf pdftotext "'%p' -"
	IndexContents HTML* .asp .htm .html .shtml *.pdf
	StoreDescription HTML* <body> 320

This is where is gets stranger. Swish will index all .asp/.htm files fine - 
but fails to open the temp files for any PDFs it encounters.

sleeping 5 seconds before fetching 
http://local.dev.port.com/pdf/database_ucp.pdf
Now fetching [http://local.dev.port.com/pdf/database_ucp.pdf]...Status: 
200. application/pdf
  - Using DEFAULT (HTML2) parser
  - Error: Couldn't open file
    ''C:\Progra~1\SWISH-E\indexes\Tmp\swishspider@1108.contents''
  (no words indexed)

retrieving http://local.dev.port.com/pdf/real_ccr.pdf (5)...
sleeping 5 seconds before fetching http://local.dev.port.com/pdf/real_ccr.pdf
Now fetching [http://local.dev.port.com/pdf/real_ccr.pdf]...Status: 200. 
application/pdf
  - Using DEFAULT (HTML2) parser
  - Error: Couldn't open file
    ''C:\Progra~1\SWISH-E\indexes\Tmp\swishspider@1108.contents''
  (no words indexed)

The file it is unable to open is always the same name. I've been able to 
pause the processing and see that swishspider@1108.contents does exist in 
the C:\Progra~1\SWISH-E\indexes\Tmp\ directory, and it's permissions are 
not out of whack. I'm guessing that pdftotext is not feeding the processed 
PDF back to the spider correctly and therefore it sets up a zero length 
file or is actually saving to "somewhere else".

The summary of work looks like this:

	Removing very common words...
	no words removed.
	Writing main index...
	Sorting words ...
	Sorting 12,770 words alphabetically
	Writing header ...
	Writing index entries ...
	  Writing word text: Complete
	  Writing word hash: Complete
	  Writing word data: Complete
	12,770 unique words indexed.
	5 properties sorted.
	980 files indexed.  324,873,230 total bytes.  278,746 total words.
	Elapsed time: 00:18:32 CPU time: 00:18:32
	Indexing done!

This time it successfully indexes the "other" pages, but of course the PDFs 
are not indexed.

Common to both examples, the StoreDescription does not appear to be acted 
on. I have no descriptions available via <swishdescription>, I get some 
Date Time String (e.g " Local Time : 1:12:01 PM PT") instead. Nor does 
swish appear to accept the IndexOnly / IndexContents directive - it 
attempts to index the PDF anyway. It grabs the file then errors on "invalid 
mime type". Is this correct behaviour? I would think that swish would skip 
the file because of the .pdf extension not being the in the approved list.


If anyone wants to hit/index an example of the site in question:

	http://test.portofoakland.com

This URL is a replica of the live site and should respond exactly the same.

--
Anthony Baratta
Received on Wed Sep 22 15:22:46 2004