On Wed, Jun 22, 2005 at 05:05:53PM -0400, Revillini, James wrote:
> RTF's are killing it now. As soon as it runs into one, the output file
> from dirtree.pl goes like this:
By the way, this is all in the docs, but here's a quick executive
summary:
DirTree.pl finds files and then passes the file name to SWISH::Filter
module.
SWISH::Filter uses MIME::Types to lookup the mime type of the file.
Then all the available SWISH::Filter modules are scanned for a regular
expression that matches the file's mime type. When found that filter
is used and the filter changes the content type to something else
(like text/plain or text/html).
The individual filters normally need helper programs, like catdoc, to
be installed before they will work. The swish distribution on windows
includes catdoc, IIRC.
When SWISH::Filter is done DirTree.pl then skips any files that are
"binary", which only means they are not of some kind of text/* type.
Really, it should only not skip if text/xml, text/plain, or text/html
as that's all swish can index. After all there's a lot of other text
types:
$ fgrep 'text/' /etc/mime.types | wc -l
62
You might want to add that test into DirTree.pl -- check for only
those three mime types:
unless ( $doc->content_type =~ m!^text/(?:plain|xml|html)$/ ) {
warn "Can't index $path because it's " . $doc->content_type . "\n";
return;
}
Anyway, that's how it all works.
--
Bill Moseley
moseley@hank.org
Unsubscribe from or help with the swish-e list:
http://swish-e.org/Discussion/
Help with Swish-e:
http://swish-e.org/current/docs
swish-e@sunsite.berkeley.edu
Received on Wed Jun 22 14:46:38 2005