Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] Problems indexing PDFs not in root web directory and raw numbers

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Wed Sep 26 2007 - 15:47:56 GMT
On 09/24/2007 02:20 PM, Parker, Peter A CONTRACTOR WRAIR-Wash DC wrote:

> 1) I have noticed that indexing of PDF files seems to be limited to the
> root directory. I have PDFs in the root directory and ones in a sub
> directory. Only PDFs in the root directory ever appear in search
> results. It is my understanding that swish-e automatically recurses
> subdirectories when indexing. Is this not also the case with indexing of
> PDF's?
> 

should be recursive with PDFs as well. All swish-e does is fork/exec the
FileFilter command when it encounters a file matching the file extension.
Are you running with -v option? That'll help debug.

> 2) I have also noticed that Swish-e does not seem to be indexing numbers
> inside of Excel or other Office files very well. When I search for a
> number I know to be in an indexed file, for example 22469, the search
> often yeilds no results.
> 
> Here is the contents of my configuration file:
> 
> IndexFile index.swish-e
> IndexDir /var/www/html
> IndexDir /var/www/twiki/data
> FollowSymLinks yes
> WordCharacters abcdefghijklmnopqrstuvwxyz0123456789.-
> IgnoreFirstChar .-
> IgnoreLastChar  .-
> BeginCharacters abcdefghijklmnopqrstuvwxyz0123456789
> EndCharacters   abcdefghijklmnopqrstuvwxyz0123456789
> ReplaceRules remove /var/www/html
> FollowSymLinks yes
> IndexReport 2
> IgnoreWords file:
> /var/www/swish-e/share/doc/swish-e/examples/conf/stopwords/english.txt
> TranslateCharacters :ascii7:
> BumpPositionCounterCharacters |.
> IndexOnly .html .htm .doc .ppt .xls .pdf .rtf .txt .jpg .bmp .png
> NoContents .jpg .gif .bmp .png .ico
> FileFilter .pdf share/doc/swish-e/examples/filter-bin/_pdf2html.pl
> IndexContents HTML .pdf
> 

I don't see a FileFilter for .xls and .rtf and .ppt and .doc.

All your problems might magically go away if you used the DirTree.pl script
instead. It handles all the filtering and recursion for you with lots of sane
defaults.

-- 
Peter Karman  .  peter(at)not-real.peknet.com  .  http://peknet.com/

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Wed Sep 26 11:47:57 2007