Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] pdftotext

From: John Laurie <j.laurie(at)not-real.auckland.ac.nz>
Date: Sun Jun 28 2009 - 22:41:11 GMT
Hi all

 

I'm having the same problem Thomas Dowling with the pdftotext creating
unwanted spaces in PDF documents. It's a crippling problem for a
database that's aiming for 100% accuracy.

 

The PDF native interface search works fine but the Swish-E based search
has a text that's full of words with gaps between the letters.  Eg. k o
o t i instead of kooti.

 

Our Swish E is the latest version with pdftotext 3.02.

 

I've only noticed it recently. It's only a big problem with some fonts
or perhaps newer versions of FineReader and Adobe. 

 

There is an example on our Early New Zealand books website at
http://www.enzb.auckland.ac.nz/

Click on Search >> Go to Advanced Search
http://www.enzb.auckland.ac.nz/advsearch.php?action=cs

Click on Limit by Title and click on the + sign beside 1887 - Gudgeon,
T. W. The Defenders of New Zealand

Tick the box beside [Pages 300-335]

 

Search For: k o o t i     This phrase from the dropdown menu in Full
Text 

 

Click on 1[pages 300-335] to view the PDF. You can copy and paste text
from the PDF with no gaps between the letters. 

 

Try the same search for kooti or t h e or a n d

 

N.B. Te Kooti is a famous Maori leader and prophet who led a bitter
struggle against the colonial government in New Zealand in the 1860s -
an antipodean Geronimo. The name is a transliteration in Maori of the
missionary name Coates. 

 

John

 

******************************************** 
John Laurie 
Digital Initiatives Librarian 
Digital Services
Level 3, General Library 
University of Auckland
Phone (09)3737599 x 85773 
Email j.laurie@auckland.ac.nz 
************************************************* 

 



_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Sun Jun 28 18:41:13 2009