Skip to main content.
home | support | download

Back to List Archive

Re: NoContents

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Feb 25 2005 - 00:11:54 GMT
On Thu, Feb 24, 2005 at 12:21:07PM -0800, Thomas Angst wrote:
> With your information about the HTML* parser, I changed the 
> DefaultContents to TXT. swish-e is now several times faster. But I get 
> now for each Image a warning. Do you know how I can suppress these 
> warnings and is there any limitation if I'm using TXT for the 
> DefaultContent-Scanner, when I will set all other scanable files to 
> another scan engine?

I guess I'd use locate(1) to index and search for file names.

I'm not sure, but I think NoContents was for indexing only <title> of
html docs, no so much for indexing the names of binary files --
because it's doesn't make much sense to read in and try to parse
binary data.

For that to work right swish needs to look at the file name before
fetching and if it's not HTML* then don't read it and just index the
name.

And if I was indexing images I think I'd index a description file and
then use ReplaceRules to change the description file to the actual
image name when indexing -- that might make searching for images more
useful.

And if I really just only wanted to index the file names then I would
use DirTree.pl and then create a text document on-the-fly for the
image files.

> Warning: Substituted 677 embedded null character(s) in file 
> '/var/samba/daten/dokumentationen/mozillamailer/pfeile.bmp' with a newline

Swish indexes text -- so sending it a binary file will confuse it.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Thu Feb 24 16:11:56 2005