Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] Want solution for indexing the .pdf files with meaningful

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Aug 10 2007 - 14:31:58 GMT
On Fri, Aug 10, 2007 at 12:29:56AM -0700, Udaya Gajanayake wrote:
> Please give me the solution for indexing the .pdf files with meaningful words
> I here included my swish.conf file for .pdf and .html .I got reasonable result for .html but not for .pdf. I put swish.conf file into swish-e installation folder and run it using “swish-e -c swish.conf” in the command prompt.I listed all the indexed words using 
> “swish-e -k*” command
>  
> 5 files (4 PDF files and 1 HTML ) contained in the document folder
> (C:/Inetpub/wwwroot/cepa/docs/pdf)
>  
> But indexed wors for 
> PDF is= 289 unique words indexed and not meaningful
> HTML is= 616 unique words indexed and meaningful
>  
> ---------------For PDF---------------------------------------------------------
> IndexDir C:/Inetpub/wwwroot/cepa/docs/pdf
> SwishProgParameters C:/Inetpub/wwwroot/cepa/docs/pdf
> IndexOnly .pdf
> IndexContents TXT .pdf
> WordCharacters abcdefghijklmnopqrstuvwxyz
> IgnoreFirstChar .-
> IgnoreLastChar  .-
> BeginCharacters abcdefghijklmnopqrstuvwxyz 
> EndCharacters   abcdefghijklmnopqrstuvwxyz 
> TranslateCharacters :ascii7:
> FollowSymLinks yes
> BumpPositionCounterCharacters |.
> IndexReport 4
> DefaultContents TXT*
> StoreDescription TXT * <body> 200000
> UndefinedMetaTags auto
> ReplaceRules remove C:/Inetpub/wwwroot/cepa/docs/pdf

Don't add items to the configuration unless you have a specific reason
to do so.

I don't see anything above that would result in *extracting* the text
from pdf files.

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Fri Aug 10 10:32:01 2007