At 11:46 AM 09/27/02 -0700, Bill Moseley wrote:
>At 02:33 PM 09/27/02 -0400, Jeffrey.Grunstein@ny.frb.org wrote:
>>
>>We have lots of PDF files and some of them are very big. None have a
>>metadata description set (the people
>>who created them are lazy).
>>
>>Will any of the filters parse the document and take the first n characters,
>>like what StoreDescription does?
>
>Yes.
>
>In the simple case you can do something like:
>
> FileFilter .pdf pdftotext "'%p' -"
> IndexContents TXT .pdf
> StoreDescription TXT 1000
BTW -- if your PDF files are *very* large then you might try using:
http://swish-e.org/current/docs/SWISH-CONFIG.html#item_TruncateDocSize
I haven't used that directive in a long time, so let us know if anything
blows up...
And you also might try using TXT2 instead of TXT. TXT reads the entire doc
into memory, where TXT2 reads in chunks. So using TruncateDocSize with
TXT2 might avoid reading more data than you need to read.
--
Bill Moseley
mailto:moseley@hank.org
Received on Fri Sep 27 19:02:57 2002