On Wed, Sep 05, 2007 at 01:06:43PM -0600, rmspamfilter@gmail.com wrote:
> I am trying to index Microsoft Document , Excel and PDF's. I do not want to
> index the content but just the titles.
> I have the following config
>
> # Example Swish-e Configuration file
> FileFilter .doc /usr/local/bin/catdoc "-s8859-1 -d8859-1 %p"
> FileFilter .pdf pdftotext "%p -"
>
> # Define *what* to index
> # IndexDir can point to a directories and/or a files
> # Here it's pointing to the current directory
> # Swish-e will also recurse into sub-directories.
> IndexDir /opt/samba/CNR
>
> # But only index the .html files
> IndexOnly .doc .pdf
>
> # Show basic info while indexing
> IndexReport 1
>
>
> Now i know the index the content inside the files but i do not want to index
> the content,
I haven't used this in a while, but might try NoContents:
IndexContents HTML* .doc .pdf
NoContents .doc .pdf
That probably won't work for the .doc file because catdoc doesn't spit
out HTML (so no <title> to look for). Same for the pdf file.
What I'd do is find tools to extract the titles from .doc and .pdf
(pdfinfo for .pdf comes to mind) and either generate a simple HTML
file or filter out the title.
--
Bill Moseley
moseley@hank.org
Unsubscribe from or help with the swish-e list:
http://swish-e.org/Discussion/
Help with Swish-e:
http://swish-e.org/current/docs
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Wed Sep 5 18:17:55 2007