Skip to main content.
home | support | download

Back to List Archive

Indexing PDF files - reliable ?

From: David Larkin <david.larkin(at)>
Date: Thu Dec 08 2005 - 21:06:57 GMT
I've made great progress since my flurry of queries last week, and now have swish-e integrated within my application. Users can upload docs which are auto indexed , and others can immediately search uploaded files. I'm really impressed with the swish-e software, but as Columbo might say ............. there's just one thing

searching of PDF files appears not to be as predictable as other file types.

I've prepared a very simple example

54:{david}% more test.conf
IndexDir ./test
IndexFile ./test.index
55:{david}% swish-e -S fs -c test.conf
Indexing Data Source: "File-System"
Indexing "./test"
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 63,777 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
63,777 unique words indexed.
4 properties sorted.
3 files indexed.  1,536,853 total bytes.  210,804 total words.
Elapsed time: 00:00:04 CPU time: 00:00:03
Indexing done!

directory test has 3 pdf files from random sources that I've copied to

when i search for the word 'the' , it is found in the first of these 3 files, when the word appears in all 3.

59:{david}% file *
Samba-Developers-Guide.pdf: PDF document, version 1.4
isj2001-final.pdf:          PDF document, version 1.2
spm.pdf:                    PDF document, version 1.3

Is it due to PDF version number ?

or how the pdf is generated , i know I converted isj2001-final.pdf from postscript using a utility program ps2pdf.

Any ideas ?

Received on Thu Dec 8 13:07:02 2005