Skip to main content.
home | support | download

Back to List Archive

Re: Getting the right files indexed the right way

From: Bill Moseley <moseley(at)>
Date: Wed Jan 28 2004 - 07:52:00 GMT
On Tue, Jan 27, 2004 at 11:30:06PM -0800, Rob de Santos AFANA wrote:

> IndexDir
> NoContents .gif .jpg .png .cgi .pl .log .jar .ico .js .class .log .sql
> .csv .dir .idx .dat

> When the indexing runs, swish-e attempts to read and interpret the jpeg
> files rather than simply adding the file path and name to the index as
> indicated in the NoContent directive.

Well, I was going to say that NoContents was not supported when using -S 
prog, but then I remembered I added it.

(This is a quickly hacked that doesn't skip binary by 

$ ./ default http://localhost/apache/finger.jpg > x
./ Reading parameters from 'default'

Summary for: http://localhost/apache/finger.jpg
Total Bytes: 19,645  (19645.0/sec)
 Total Docs:      1  (1.0/sec)
Unique URLs:      1  (1.0/sec)

$ swish-e -S prog -i stdin < x | grep 'words indexed'
2,217 unique words indexed.

$ cat c
NoContents .jpg

$ swish-e -S prog -i stdin -c c < x | grep 'words indexed'
5 unique words indexed.

$ swish-e -w not dkdk
# SWISH format: 2.4.1
# Search words: not dkdk
# Removed stopwords: 
# Number of hits: 1
# Search time: 0.001 seconds
# Run time: 0.042 seconds
1000 http://localhost/apache/finger.jpg "finger.jpg" 19645

So it does work.  But, what I would recommend is in your spider config 
file filter_content() function when you see those extensions do

     $$content_ref = $uri;
     return 1;

So you are just replacing the content with the path.  That also avoids 
sending all that data onto swish where it will just discard it.

Hope that helps.

Bill Moseley
Received on Tue Jan 27 23:52:00 2004