Skip to main content.
home | support | download

Back to List Archive

Re: Swish-e PDF titles in search results

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Jul 14 2006 - 16:44:57 GMT
On Fri, Jul 14, 2006 at 04:32:14PM +0100, Luke Simmons wrote:
> [root (at) tiger archive]# /usr/local/lib/swish-e/DirTree.pl
> edjanfeb06.pdf | grep title
> 
> <title>Jan Feb 06</title>
> <meta name="title" content="Jan Feb 06">
> 
> But without a filter it appears to not be parsing the html output  
> from the pdf to the index. So after an index it doesn't show anything  
> up in the search (cgi) including the title.
> 
> Do I need to add pdf2HTML as a file filter in the config? And also  
> make the changes that Peter Karman suggested?  (thanks Peter)
> 
> FileFilter .pdf /usr/local/lib/swish-e/perl/SWISH/Filters/ 
> Pdf2HTML.pm    # Does this or anything need to go here?

No, again, as you can see from the output of DirTree.pl it's
producing *html*, so you don't want to tell swish to convert it to
html again.  It's already been converted.


> DefaultContents HTML*
> StoreDescription HTML* <body> 200000

So, if you run swish-e from the command line is is the body stored in
the swishdescription property?


> Am I right to believe that when indexing the process pulls the PDF  
> apart and each part is HTML tagged up (i.e. title > <title></title>  
> and the text snippet to <body></body>)?

Not indexing (swish-e), but DirTree.pl does that (by using
SWISH::Filter).  Look at DirTree.pl's output.


> Is the process then not putting the HTML into the index?

Should be.  You can use -T indexed_words properties to see what's
ending up in the index while indexing.

> I added the old FileFilter of pdftotext in and this runs ok just  
> without the title attribute working.

That shouldn't work.  pdftotext isn't very good at converting html to
text.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Fri Jul 14 09:44:59 2006