Skip to main content.
home | support | download

Back to List Archive

Re: Getting the right files indexed the right way

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Jan 28 2004 - 07:52:00 GMT
On Tue, Jan 27, 2004 at 11:30:06PM -0800, Rob de Santos AFANA wrote:

> IndexDir spider.pl
> 
> NoContents .gif .jpg .png .cgi .pl .log .jar .ico .js .class .log .sql
> .csv .dir .idx .dat

> When the indexing runs, swish-e attempts to read and interpret the jpeg
> files rather than simply adding the file path and name to the index as
> indicated in the NoContent directive.

Well, I was going to say that NoContents was not supported when using -S 
prog, but then I remembered I added it.

(This is a quickly hacked spider.pl that doesn't skip binary by 
default):

$ ./spider.pl default http://localhost/apache/finger.jpg > x
./spider.pl: Reading parameters from 'default'

Summary for: http://localhost/apache/finger.jpg
Total Bytes: 19,645  (19645.0/sec)
 Total Docs:      1  (1.0/sec)
Unique URLs:      1  (1.0/sec)

$ swish-e -S prog -i stdin < x | grep 'words indexed'
2,217 unique words indexed.

$ cat c
NoContents .jpg

$ swish-e -S prog -i stdin -c c < x | grep 'words indexed'
5 unique words indexed.

$ swish-e -w not dkdk
# SWISH format: 2.4.1
# Search words: not dkdk
# Removed stopwords: 
# Number of hits: 1
# Search time: 0.001 seconds
# Run time: 0.042 seconds
1000 http://localhost/apache/finger.jpg "finger.jpg" 19645
.

So it does work.  But, what I would recommend is in your spider config 
file filter_content() function when you see those extensions do

     $$content_ref = $uri;
     return 1;

So you are just replacing the content with the path.  That also avoids 
sending all that data onto swish where it will just discard it.

Hope that helps.


-- 
Bill Moseley
moseley@hank.org
Received on Tue Jan 27 23:52:00 2004