Skip to main content.
home | support | download

Back to List Archive

Swish-e PDF titles in search results

From: Luke Simmons <lukes(at)not-real.deeson.co.uk>
Date: Fri Jul 14 2006 - 15:36:11 GMT
Hi Bill,

I got the config to alias the meta name swishtitle to title

[root (at) tiger archive]# /usr/local/lib/swish-e/DirTree.pl
edjanfeb06.pdf | grep title

<title>Jan Feb 06</title>
<meta name="title" content="Jan Feb 06">

But without a filter it appears to not be parsing the html output  
from the pdf to the index. So after an index it doesn't show anything  
up in the search (cgi) including the title.

Do I need to add pdf2HTML as a file filter in the config? And also  
make the changes that Peter Karman suggested?  (thanks Peter)

FileFilter .pdf /usr/local/lib/swish-e/perl/SWISH/Filters/ 
Pdf2HTML.pm    # Does this or anything need to go here?
DefaultContents HTML*
StoreDescription HTML* <body> 200000
# MetaNames swishdefault

Am I right to believe that when indexing the process pulls the PDF  
apart and each part is HTML tagged up (i.e. title > <title></title>  
and the text snippet to <body></body>)?

Is the process then not putting the HTML into the index?

I added the old FileFilter of pdftotext in and this runs ok just  
without the title attribute working.


Thanks again for your help

Luke
Received on Fri Jul 14 08:36:17 2006