Hi Bill,
I got the config to alias the meta name swishtitle to title
[root (at) tiger archive]# /usr/local/lib/swish-e/DirTree.pl
edjanfeb06.pdf | grep title
<title>Jan Feb 06</title>
<meta name="title" content="Jan Feb 06">
But without a filter it appears to not be parsing the html output
from the pdf to the index. So after an index it doesn't show anything
up in the search (cgi) including the title.
Do I need to add pdf2HTML as a file filter in the config? And also
make the changes that Peter Karman suggested? (thanks Peter)
FileFilter .pdf /usr/local/lib/swish-e/perl/SWISH/Filters/
Pdf2HTML.pm # Does this or anything need to go here?
DefaultContents HTML*
StoreDescription HTML* <body> 200000
# MetaNames swishdefault
Am I right to believe that when indexing the process pulls the PDF
apart and each part is HTML tagged up (i.e. title > <title></title>
and the text snippet to <body></body>)?
Is the process then not putting the HTML into the index?
I added the old FileFilter of pdftotext in and this runs ok just
without the title attribute working.
Thanks again for your help
Luke
Received on Fri Jul 14 08:36:17 2006