On Fri, Aug 10, 2007 at 12:29:56AM -0700, Udaya Gajanayake wrote:
> Please give me the solution for indexing the .pdf files with meaningful words
> I here included my swish.conf file for .pdf and .html .I got reasonable result for .html but not for .pdf. I put swish.conf file into swish-e installation folder and run it using “swish-e -c swish.conf” in the command prompt.I listed all the indexed words using
> “swish-e -k*” command
>
> 5 files (4 PDF files and 1 HTML ) contained in the document folder
> (C:/Inetpub/wwwroot/cepa/docs/pdf)
>
> But indexed wors for
> PDF is= 289 unique words indexed and not meaningful
> HTML is= 616 unique words indexed and meaningful
>
> ---------------For PDF---------------------------------------------------------
> IndexDir C:/Inetpub/wwwroot/cepa/docs/pdf
> SwishProgParameters C:/Inetpub/wwwroot/cepa/docs/pdf
> IndexOnly .pdf
> IndexContents TXT .pdf
> WordCharacters abcdefghijklmnopqrstuvwxyz
> IgnoreFirstChar .-
> IgnoreLastChar .-
> BeginCharacters abcdefghijklmnopqrstuvwxyz
> EndCharacters abcdefghijklmnopqrstuvwxyz
> TranslateCharacters :ascii7:
> FollowSymLinks yes
> BumpPositionCounterCharacters |.
> IndexReport 4
> DefaultContents TXT*
> StoreDescription TXT * <body> 200000
> UndefinedMetaTags auto
> ReplaceRules remove C:/Inetpub/wwwroot/cepa/docs/pdf
Don't add items to the configuration unless you have a specific reason
to do so.
I don't see anything above that would result in *extracting* the text
from pdf files.
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Fri Aug 10 10:32:01 2007