Skip to main content.
home | support | download

Back to List Archive

[swish-e] After swish-e index built, PDF file sizes changed to 1-3 bytes

From: Greg Keith <Greg.Keith(at)not-real.noaa.gov>
Date: Thu Feb 26 2009 - 22:04:00 GMT
Hi all -

I'm a newbie user of swish-e, using version 2.4.5 on a 32-bit Linux
machine running RHEL5.3, Perl 5.8.8. I've installed catdoc
(catdoc-0.94.2-1.el5.rf), xpdf Linux binaries
(xpdf-3.02pl2-linux.tar.gz), all the possible CPAN Perl module
prerequisites for swish-e, and built the SWISH API.

I'm setting up swish-e as a search engine for our intranet site,
replacing htdig. I had attempted to build an index using the following
config file as swish.conf:

# IndexReport controls how detailed the report is while indexing.
IndexReport 3

# Directory to index
IndexDir /http/intranet

# Filters - these are extra programs used to decode content
FileFilter .doc       /usr/bin/catdoc
FileFilter .pdf       /usr/local/bin/pdftotext
FileFilter .ppt       /usr/bin/catppt
FileFilter .xls       /usr/bin/xls2csv

...and using the following command:

swish-e -c swish.conf -S fs

When I execute this to build the index, I get a lot of  strange
characters in output ( I should mention that the filesystem I'm trying
to index contains the gamut of Unix & Windows file types), and one
very odd phenomenon - after the index is built, all the PDF files have
now been reduced to 1-3 bytes in size. When the swish-e documentation
talks about using filters to convert binary files for indexing, I was
imagining "convert" meant "parse", since I didn't want to index PDF
files if the original files are lost after indexing!

Does this seem like user error, or a bug in Xpdf's pdftotext, or a bug
in swish-e?

I know I don't get this problem with a more fully-featured config
file, like the following:

# IndexReport controls how detailed the report is while indexing
IndexReport 2

# Directory to index
IndexDir /http/intranet

# What to index
IndexOnly .htm .html .ppt .doc .pdf .ppt .shtml .txt

# Exclude all files of these types
FileRules filename contains \.(gif|jpg|pro|asp|php|png|css|js)$

# Filters - these are extra programs used to decode content
FileFilter .doc       /usr/bin/catdoc
FileFilter .pdf       /usr/local/bin/pdftotext
FileFilter .ppt       /usr/bin/catppt
FileFilter .xls       /usr/bin/xls2csv

Also, to get the index build output recorded, do I have to explicitly
redirect the output of the index command to a file, or should there be
a log of problems encountered during the indexing? I haven't see this
detail in the documentation.

Thanks for any help with this issue!

Greg

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Thu Feb 26 17:04:05 2009