Skip to main content.
home | support | download

Back to List Archive

RE: Indexing pdf files

From: <Rainer.Scherg(at)not-real.rexroth.de>
Date: Tue Jul 31 2001 - 17:13:24 GMT
Please check, if the "pdf-filter.sh" has the correct $PATH available
(use  set >/tmp/debug-env or something like that) .

In the worst case, include a hard coded path to pdftotext into the filter
script.


cu - rainer

> -----Original Message-----
> From: Kemp Randy-W18971 [mailto:Randy.L.Kemp@motorola.com]
> Sent: Tuesday, July 31, 2001 6:58 PM
> To: Multiple recipients of list
> Subject: [SWISH-E] Indexing pdf files
> 
> 
> I can't for the life of me get my pdffiles to index.  My 
> executables are in
> /usr/local/ and they are work individually, including 
> pdftotext.  Could
> someone please help me with the filters on the Solaris sparc 
> 5.6 platform
> with swishe 2.0.5?
> 
> -----------------------> executables for pdftotext and swish-e
> <---------------------------------------------------------------------
> ee110:/usr/local> ls
> bin           conf          lib           perl          swish-e
> cmold.d       ecg           pdftotext     psionic       swish-search
> 
> I am running version 2.0.5 of swish.e
> 
> -----------------------------> pdf file
> <-------------------------------------------------------------
> --------------
> -------
> My pdf file is located in ee110:/usr2/apache/htdocs/pdffiles> ls
> requirements.pdf
> 
> -----------------------------> config file at 
> e110:/usr2/ecadtesting/swishe-index> ls
> index.swish      search.log       swisheconf.conf
> 
> My config file is
> 
> # Sample SWISH configuration file
> 
> # Global Networks Technical Support, 
> support@gobalnetworks.com, 5/10/96
> 
> 
> 
> #IndexDir /usr/home/globalne/usr/local/etc/httpd/htdocs/
> IndexDir /usr2/apache/htdocs/pdffiles/
> 
> 
> 
> # This is a space-separated list of files and
> 
> # directories you want indexed. You can specify
> 
> # more than one of these directives.
> 
> # Be sure to change globalne to be your Server login name.
> 
> 
> 
> IndexFile /usr2/ecadtesting/swishe-index/index.swish
> 
> # This is what the generated index file will be.
> 
> 
> 
> IndexName "PCS Web Page Index"
> 
> IndexDescription "This is a full index of the PCS web site."
> 
> IndexPointer "http://ee110.ecg.csg.mot.com:8000/cgi-bin/search.cgi"
> 
> IndexAdmin "PCS Technical Support (Randy.L.Kemp@motorola.com)"
> 
> # Extra information you can include in the index file.
> 
> # You probably want to change the Global Networks references.
> 
> 
> 
> IndexOnly .html .htm .txt .gif .xbm .au .mov .mpg
> 
> # Only files with these suffixes will be indexed.
> 
> 
> 
> IndexReport 3
> 
> # This is how detailed you want reporting. You can specify numbers
> 
> # 0 to 3 - 0 is totally silent, 3 is the most verbose.
> 
> 
> 
> FollowSymLinks yes
> 
> # Put "yes" to follow symbolic links in indexing, else "no".
> 
> 
> 
> NoContents .gif .xbm .au .mov .mpg
> 
> # Files with these suffixes will not have their contents indexed -
> 
> # only their file names will be indexed.
> 
> 
> 
> #ReplaceRules replace "/usr/home/globalne/usr/local/etc/httpd/htdocs"
> "http://www.globalnetworks.com"
> ReplaceRules replace "/usr2/apache/htdocs"
> "http://ee110.ecg.csg.mot.com:8000"
> 
> 
> # ReplaceRules allow you to make changes to file pathnames
> 
> # before they're indexed.
> 
> # Be sure to change globalne to be your Server login name.
> 
> 
> 
> FileRules pathname contains admin testing demo trash construction
> confidential
> 
> FileRules filename is index.html
> 
> FileRules filename contains # % ~ .bak .orig .old old.
> 
> FileRules title contains construction example pointers
> 
> FileRules directory contains .htaccess
> 
> # Files matching the above criteria will *not* be indexed.
> 
> 
> 
> IgnoreLimit 50 100
> 
> # This automatically omits words that appear too often in the files
> 
> # (these words are called stopwords). Specify a whole percentage
> 
> # and a number, such as "80 256". This omits words that occur in
> 
> # over 80% of the files and appear in over 256 files. Comment out
> 
> # to turn of auto-stopwording.
> 
> 
> 
> IgnoreWords SwishDefault
> 
> # The IgnoreWords option allows you to specify words to ignore.
> 
> # Comment out for no stopwords; the word "SwishDefault" will
> 
> # include a list of default stopwords. Words should be 
> separated by spaces
> 
> # and may span multiple directives.
> 
> FilterDir /usr2/ecadtesting/shellscripts/
> FileFilter .pdf pdf-filter.sh
> 
> ------------------> Text results with pdf (html docs will work ok in
> directory htdocs) 
> <------------------------------------------------------
> My test results are:
> 
> ee110:/usr2/ecadtesting/shellscripts> ls
> dailystats.sh    ncftpput.sh      rkgraph001.sh    webalizer.sh
> http-analyze.sh  pdf-filter.sh    swishe.sh
> Checking dir "/usr2/apache/htdocs/pdffiles/"...
> 
> Removing very common words...
> 336 words removed.
> 0 words removed not in common words array:
> 
> Writing main index...
> Computing hash table ...
> Writing header ...
> Writing index entries ...
> Writing stopwords ...
> no unique words indexed.
> Writing file index...
> Writing file list ...
> Writing file offsets ...
> Writing MetaNames ...
> Writing offsets (2)...
> no files indexed.
> Running time: Less than a second.
> Indexing done!
> ee110:/usr2/ecadtesting/shellscripts> 
> 
> ee110:/usr2/ecadtesting/shellscripts> more swishe.sh
> /usr/local/swish-e -c /usr2/ecadtesting/swishe-index/swisheconf.conf
> 
> 
> 
> 
> 
> -----------------------------------------------------------
> This Mail has been checked for Viruses
> Attention: Encrypted Mails can NOT be checked !
> 
> ***
> 
> Diese Mail wurde auf Viren ueberprueft
> Hinweis: Verschluesselte Mails koennen NICHT geprueft werden!
> ------------------------------------------------------------
> 
Received on Tue Jul 31 17:13:58 2001