Hi all -
I'm a newbie user of swish-e, using version 2.4.5 on a 32-bit Linux
machine running RHEL5.3, Perl 5.8.8. I've installed catdoc
(catdoc-0.94.2-1.el5.rf), xpdf Linux binaries
(xpdf-3.02pl2-linux.tar.gz), all the possible CPAN Perl module
prerequisites for swish-e, and built the SWISH API.
I'm setting up swish-e as a search engine for our intranet site,
replacing htdig. I had attempted to build an index using the following
config file as swish.conf:
# IndexReport controls how detailed the report is while indexing.
IndexReport 3
# Directory to index
IndexDir /http/intranet
# Filters - these are extra programs used to decode content
FileFilter .doc /usr/bin/catdoc
FileFilter .pdf /usr/local/bin/pdftotext
FileFilter .ppt /usr/bin/catppt
FileFilter .xls /usr/bin/xls2csv
...and using the following command:
swish-e -c swish.conf -S fs
When I execute this to build the index, I get a lot of strange
characters in output ( I should mention that the filesystem I'm trying
to index contains the gamut of Unix & Windows file types), and one
very odd phenomenon - after the index is built, all the PDF files have
now been reduced to 1-3 bytes in size. When the swish-e documentation
talks about using filters to convert binary files for indexing, I was
imagining "convert" meant "parse", since I didn't want to index PDF
files if the original files are lost after indexing!
Does this seem like user error, or a bug in Xpdf's pdftotext, or a bug
in swish-e?
I know I don't get this problem with a more fully-featured config
file, like the following:
# IndexReport controls how detailed the report is while indexing
IndexReport 2
# Directory to index
IndexDir /http/intranet
# What to index
IndexOnly .htm .html .ppt .doc .pdf .ppt .shtml .txt
# Exclude all files of these types
FileRules filename contains \.(gif|jpg|pro|asp|php|png|css|js)$
# Filters - these are extra programs used to decode content
FileFilter .doc /usr/bin/catdoc
FileFilter .pdf /usr/local/bin/pdftotext
FileFilter .ppt /usr/bin/catppt
FileFilter .xls /usr/bin/xls2csv
Also, to get the index build output recorded, do I have to explicitly
redirect the output of the index command to a file, or should there be
a log of problems encountered during the indexing? I haven't see this
detail in the documentation.
Thanks for any help with this issue!
Greg
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Thu Feb 26 17:04:05 2009