On Sat, Jul 28, 2007 at 01:43:07PM +1000, Dr Michael Daly wrote:
> Dear list
> If anyone can solve this mystery, it would be great! Swish-e 2.4.5 (on
> centos) is failing to index some pdf documents. Here is the index file:
> IndexDir /home/server_dir/Resources/Research/2007
> ReplaceRules remove /home/server_dir/
> IndexOnly .htm .html .txt .doc .pdf
> IndexContents TXT* .txt
> DefaultContents HTML*
> ParserWarnLevel 9
> IndexFile /home/indices/for_index4.index
>
>
> 4. this seems to work:
> swish-e -i
> /home/server_dir/Resources/Research/2007/Low_Purine_Diet_405.pdf -T
> indexed_words | less
> eg
> Warning: Substituted 86 embedded null character(s) in file
> '/home/server_dir/Resources/Research/2007/Low_Purine_Diet_405.pdf'
> with a newline
You are not telling swish how to convert the pdf file to a text file.
You need to specify a filter or use a script like spider.pl or
DirTree.pl that knows to use pdftotext on the pdf to convert it to
text.
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Sun Jul 29 01:19:25 2007