Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] Index Doc , excel , pdf Titles Only

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Sep 05 2007 - 22:17:55 GMT
On Wed, Sep 05, 2007 at 01:06:43PM -0600, rmspamfilter@gmail.com wrote:
> I am trying to index Microsoft Document , Excel and PDF's. I do not want to
> index the content but just the titles.
> I have the following config
> 
>   # Example Swish-e Configuration file
> FileFilter .doc       /usr/local/bin/catdoc "-s8859-1 -d8859-1 %p"
> FileFilter .pdf       pdftotext   "%p -"
> 
>     # Define *what* to index
>     # IndexDir can point to a directories and/or a files
>     # Here it's pointing to the current directory
>     # Swish-e will also recurse into sub-directories.
>     IndexDir /opt/samba/CNR
> 
>     # But only index the .html files
>     IndexOnly .doc .pdf
> 
>     # Show basic info while indexing
>     IndexReport 1
> 
> 
> Now i know the index the content inside the files but i do not want to index
> the content,

I haven't used this in a while, but might try NoContents:


    IndexContents HTML* .doc .pdf
    NoContents .doc .pdf

That probably won't work for the .doc file because catdoc doesn't spit
out HTML (so no <title> to look for). Same for the pdf file.

What I'd do is find tools to extract the titles from .doc and .pdf
(pdfinfo for .pdf comes to mind) and either generate a simple HTML
file or filter out the title.



-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Wed Sep 5 18:17:55 2007