Skip to main content.
home | support | download

Back to List Archive

FileMatch problem with latest version

From: Phil Glatz <phil(at)not-real.glatz.com>
Date: Mon Oct 06 2003 - 21:52:06 GMT
I'm running 2.4.0-pr3 on a FreeBSD 5 box.  I'm new to SWISH-E (but really 
like what I've seen so far), and am having trouble filtering the files I 
want to index.

The directory I'm indexing has files that I want to be included of the form 
99999.html - the file name consists of a series of digits, followed by 
".html".  There are other files with numeric names ending in different 
"extension" strings, and other names starting with alphabetic characters.

My config file consists of one directive:
FileMatch filename contains ^[0-9]+\.html$


I'm calling the indexer with the command:
swish-e -c site.conf -f abilene.index -e -v 2 -i /www/abilene -T regex

If I grep the output with the strings 37154 and Real_Estate, I see the 
following:
File[Rules|Match] filename match 37154.password =~ m[^[0-9]+\.html$] : nope
File[Rules|Match] filename match 37154.email =~ m[^[0-9]+\.html$] : nope
File[Rules|Match] filename match 37154.html =~ m[^[0-9]+\.html$] : matched
File[Rules|Match] filename match 37154.msgs =~ m[^[0-9]+\.html$] : nope
File[Rules|Match] filename match 37154.old =~ m[^[0-9]+\.html$] : nope
File[Rules|Match] filename match 37154.renew =~ m[^[0-9]+\.html$] : nope
File[Rules|Match] filename match 37154.views =~ m[^[0-9]+\.html$] : nope
File[Rules|Match] filename match Real_Estate.html =~ m[^[0-9]+\.html$] : nope
File[Rules|Match] filename match Real_Estate.links =~ m[^[0-9]+\.html$] : nope

This leads me to think that the only file from this group added to the 
index is 37154.html

But when I run the search command
swish-e -c /usr/local/swish/site.conf -f 
/usr/local/swish/index/abilene.index -w "house"

my output is
# SWISH format: 2.4.0-pr3
# Search words: house
# Removed stopwords:
# Number of hits: 7
# Search time: 0.000 seconds
# Run time: 0.016 seconds
1000 /www/abilene/Real_Estate.html "Abilene, TX Real Estate Free 
Classifieds" 2271
431 /www/abilene/41094.html "Turn Your Yearly Income into Your Monthly 
Income!" 2453
431 /www/abilene/41341.html "Chile Land ForSale" 2649
431 /www/abilene/31133.html "Scotland for Great Family Vacations" 3047
431 /www/abilene/40605.html "antique iron pot" 2118
431 /www/abilene/37985.html "Clerical, word processors needed!" 2070
431 /www/abilene/40812.html "Scotland National Park" 2823

I'm wondering why Real_Estate.html, which shouldn't be in the index, is 
coming up as a hit.

I tried using -w "a" in my search query, and get
1000 /www/abilene/41089.gif "41089.gif" 46096
852 /www/abilene/41378.gif "41378.gif" 48146
841 /www/abilene/41377.jpg "41377.jpg" 81223
824 /www/abilene/41128.jpg "41128.jpg" 27751
818 /www/abilene/39308.jpg "39308.jpg" 54406


After trying all sorts of configuration options, I'm wondering if it's just 
my ignorance, or perhaps there is something in 2.4.0-pr3 that is causing 
the problem.  Any suggestions would be greatly appreciated.
Received on Mon Oct 6 21:52:09 2003