Hi Bill,
I am emailing you in reference to a problem I am having with Swish-e.
I have found and followed this reply in your discussion list - http://
www.swish-e.org/archive/2005-02/9062.html.
The person you helped, had a problem with PDF files being indexed
without the title meta data being used as swishtitle. The swishtitle
would show the filename of the pdf instead and this then shows up in
the search results. HTML results would be fine though.
I too am experiencing this problem, and despite following your
instructions on the reply you posted thoroughly, I still cannot get
the PDFs to index correctly and present the document title as the
swishtitle. My version of xpdf is up to date.
After completing the section on how DirTree.pl deals with the file
(i.e. outputting the meta data contents of the PDF) -
[root (at) tiger archive]# /usr/local/lib/swish-e/DirTree.pl
edjanfeb06.pdf | head -30
<head>
<meta name="author" content="A person">
<meta name="creationdate" content="Tue Jan 3 11:10:41 2006">
<meta name="creator" content="QuarkXPress: pictwpstops filter 1.0">
<meta name="encrypted" content="no">
<meta name="file_size" content="2711063 bytes">
<meta name="moddate" content="Thu Jul 13 10:42:32 2006">
<meta name="optimized" content="no">
<meta name="page_size" content="595 x 842 pts (A4)">
<meta name="pages" content="36">
<meta name="pdf_version" content="1.5">
<meta name="producer" content="Acrobat Distiller 6.0.1 for Macintosh">
<meta name="tagged" content="no">
<meta name="title" content="Jan Feb 06">
I can get through to the section on your instructions where you
request -
[root (at) tiger archive]# /usr/local/lib/swish-e/DirTree.pl
edjanfeb06.pdf | swish-e -S prog -i stdin -c ../../cgi-bin/archswish/
swish.conf -v0 -T properties
Error: May not be a PDF file (continuing anyway)
Error (0): PDF file is damaged - attempting to reconstruct xref table...
Error: Couldn't find trailer dictionary
Error: Couldn't read xref table
swishdocpath: 6 ( 16) S: "./edjanfeb06.pdf"
swishdocsize: 8 ( 4) N: "140528"
swishlastmodified: 9 ( 4) D: "2006-07-13 10:51:56 BST"
Warning: Unknown header line: '38' from program stdin
Everything up to this point works correctly, and I have put the
"PropertyNameAlias swishtitle title" into swish.conf (my swish config
file). Is there a specific place this should sit?
(contents of config file - swish.conf -
IndexDir / -- server path -- /archive
IndexOnly .htm .pdf .php
MaxWordLimit 15
PropertyNameAlias swishtitle title
DefaultContents HTML*
StoreDescription HTML* <body> 200000
MetaNames swishdocpath swishtitle
#MetaNames swishdefault
ReplaceRules remove / -- server path -- /archive/
FileFilter .pdf /usr/local/bin/pdftotext "'%p' -"
)
Also despite the prompt suggesting the file may not be a PDF, this
occurs on all PDFs that the command is ran on. Also it is not
damaged, it was a fresh document (I also ran this on a very fresh
clear PDF and it returned the same error).
I found the bit about PDF titles in Filters.pm, at the end where the
comments suggest the inclusion of
my %user_data;
$user_data{pdf}{title_tag} = 'title';
$was_filtered = $filter->filter(
document => $filename,
user_data => \%user_data,
);
into Pdf2HTML.pm. But with this I was unsure a) where to put this and
b) whether it was required if the PropertyNameAlias directive was
working?
I therefore ask for your help as to what am I doing wrong?
Thanks
Luke
Received on Thu Jul 13 06:45:37 2006