Skip to main content.
home | support | download

Back to List Archive

SWISH-E Version 1.2 Released

From: Roy Tennant <rtennant(at)not-real.library.berkeley.edu>
Date: Wed Oct 07 1998 - 18:01:33 GMT
No doubt by now, thanks to my delay in getting the word out, everyone is
aware that there is a new version of SWISH-Enhanced, version 1.2. But you
may not be aware of the new features. I am appending the release notes
below, which are also on the Web site, but the highlights are below.
Everything is available at:

http://sunsite.Berkeley.EDU/SWISH-E/

* Thanks to Ron Klatchko at UCSF, SWISH-E 1.2 has crawling support. This
will allow users to index a group of servers, for example. A hint: if you
don't want it to take a long time, pay attention to the fetching interval.
* Thanks to Mark Gaulin, the source files are now both Unix and Windows NT
ready. A small addition at compile time is all that is required. 
* Many more variables can now be set for individual indexes, since they
have been moved to the configuration file rather than config.h. However,
if you choose not to set them individually, the defaults will apply. The
parameters in question include MinWordLimit,
MaxWordLimit, WordCharacters, BeginCharacters, EndCharacters,
IgnoreLastChar, and IgnoreFirstChar.
* All the strings for the "ReplaceRule [replace|remove]" and for all the
"FileRules" settings can now use regular expressions.
* We brought the source files into compliance with the GNU public license.

Also, thanks to Giulia Hill, our programmer, the SWISH-E Manual has had a
makeover. It is now easier to use and understand, and has also been
brought up-to-date.

Finally, we will no longer be distributing compiled source of the base
version (we will, however, accept contributions of compiled source for
ports). Since there are a couple decisions that must be made at compile
time (such as whether it will be a crawling or filesystem version), we
feel this to be a better distribution method.
Roy Tennant
Giulia Hill
SWISH-E Managers

What's new in SWISH-E 1.2

Use of "ReplaceRule remove" to remove part of a
string in the filename. 
       It was not possible to replace part of a string in the
       path of a file name with nothing. The solution has
       been to add a new rule that remove part of a string
       according to the regex used in the rule. 
Use of regular expressions for "ReplaceRule
replace" and all the "FileRules" 
       All the strings for the "ReplaceRule [replace|remove]"
       and for all the "FileRules" settings can now use
       regular expressions as defined in the regex.h C
       library.
Switch for comment indexing 
       In the user configuration file there is a new variable
       IndexComments; the default is 1, and it can be set to 0
       for skipping comments during indexing. 
More config variables in the user configuration file
from the config.h 
       Several configuration variables are now available in
       the user config files so that indexing can be better
       customized by the end user of SWISH-E. This change
       should be particularly useful for sites that have one
       centralized version of SWISH-E which is accessed by
       several users. It is not necessary for the user to define
       all the available variables since commenting out the
       respective line will automatically retrieve the default
       from the config.h file. In particular the newly user
       available config variables are: MinWordLimit,
       MaxWordLimit, WordCharacters, BeginCharacters,
       EndCharacters, IgnoreLastChar, and IgnoreFirstChar.
       For direction on how to use them, see at the end the
       user configuration file example. 

          From our contributors

Spidering or HTTP vs. FILESYSTEM methods by
Ron Klatchko 
       You are not limited any longer to indexing files on the
       same machine, but can follow links through web
       pages. The choice of which kind of method is used, is
       done at compile time, but the indexes can be searched
       by either compilation. See the documentation for more
       details. Few configuration variables have been added
       to the user' conf file, of which a sample follows
       below.
Win32 Compatibility by Mark Gaulin 
       Like spidering, this is another much awaited feature.
       No more porting to be done, just add #define _WIN32
       1 
Skipping the rest of the words in the index if word
not found - after a suggestion by Jeremy Ellman 
       In SWISH-E previous versions, the search of a word
       would look at all the words starting with the same
       character as the search word, even if, since the words
       are in alphabetical order, it can be determined when it
       is superfluous to check the following words. This
       change should, on average, speed up searches quite a
       bit.
Ignore Stopwords in Query - by Jeff Morrow 
       There is a new configuration variable in the config.h
       file which allows ignoring of stopwords in a query
       search. The problem arising from the fact that a query
       search containing a stopword when the default rule is
       AND did cause a no-result response. In the config.h: 
       #define IGNORE_STOPWORDS_IN_QUERY 1 
       /* Added JM 1/10/98. Setting this to 0 (default) causes
       a stopword in an AND_RULE search to create an
       empty result. Setting it to 1 simply ignores the
       stopwords and does a search on the remaining words.
       */ 
Full ISO 8859 character set support - by Lars L.
Madsen 

USER CONFIGURATION FILE EXAMPLE
###################################################
# DIRECTIVES COMMON to  HTTP and FILESYSTEM METHODS
###################################################

IndexDir /data/_b/safeweb/swish/dir1/records
# For the FileSystem Method:
#   This is a space-separated list of files and
#   directories you want indexed. You can specify
#   more than one of these directives.
# For the HTTP Method:
#   Use the URL's from which you want the spidering
#   to begin.

IndexFile /data/_b/safeweb/swish/dir1/myindex1
# This is what the generated index file will be.

IndexName "Improvement index"
IndexDescription "This is an index to test bug fixes in swish." 
IndexPointer "http://sunsite/~ghill/swish/index.html"
IndexAdmin "Giulia Hill, (ghill@library.berkeley.edu)"
# Extra information you can include in the index file.

MetaNames first author
# List of all the META names used in the file to index, must be on one
line.
# If no metanames DO NOT delete the line.

IndexReport 3
# This is how detailed you want reporting. You can specify numbers
# 0 to 3 - 0 is totally silent, 3 is the most verbose.

FollowSymLinks yes
# Put "yes" to follow symbolic links in indexing, else "no".

ReplaceRules remove "ghill/"
ReplaceRules replace "[a-z_0-9]*_m.*\.html" "index.html"
#ReplaceRules replace "/ghill" "moreghillmore"
# ReplaceRules allows you to make changes to file pathnames
# before they're indexed. This directive uses C library
# regex.h regular expressions.
# NOTE: do not use 'replace <string> ""' to remove a string,
# use 'remove <string>' instead - you might get a core dump otherwise.

#MinWordLimit 5
# Set the minimum length of an indexable word. Every shorter word
# will not be indexed.
# Commenting out the line will give the defaults

#MaxWordLimit 5
# Set the maximum length of an indexable word. Every longer word
# will not be indexed.
# Commenting out the line will give the defaults

#WordCharacters abcdefghijklmnopqrstuvwxyz\?0123456789.@|,-'"[](~!@$%^{}_+?
# WORDCHARS is a string of characters which SWISH permits to
# be in words. Any strings which do not include these characters
# will not be indexed. You can choose from any character in
# the following string:
#
# abcdefghijklmnopqrstuvwxyz0123456789_\|/-+=?!@$%^'"`~,.[]{}()
#
# Note that if you omit "0123456789?" you will not be able to
# index HTML entities. DO NOT use the asterisk (*), lesser than
# and greater than signs (<), (>), or colon (:).
#
# Including any of these four characters may cause funny things to happen.
# NOTE: Do not escape \ nor " and they cannot be the first letter in the
# string. Commenting out the line will give the defaults

#BeginCharacters m"
# Of the characters that you decide can go into words, this is
# a list of characters that words can begin with. It should be
# a subset of (or equal to) WordCharacters
# Same rule of syntax as for WordCharacters

#EndCharacters \"\
# Of the characters that you decide can go into words, this is
# a list of characters that words can begin with. It should be
# a subset of (or equal to) WordCharacters
# Same rule of syntax as for WordCharacters

IgnoreLastChar 
# Array that contains the char that, if considered valid in the middle of 
# a word need to be disreguarded when at the end. It is important to also
# set the given char's in the ENDCHARS array, otherwise the word will not
# be indexed because it will be considered invalid.
# Commenting out the line will give the defaults
# NOTE: if " is the first char in the string it needs to be escaped with \
# Do not escape otherwise

IgnoreFirstChar 
# Array that contains the char that, if considered valid in the middle of
# a word need to be disreguarded when at the beginning. This was to solve
# the problem of parenthesis when there is no space between ( and the
# beginning of the word.
# Remember to add the char's to the BEGINCHARS list also.
# Commenting out the line will give the defaults
# NOTE: if " is the first char in the string it needs to be escaped with \
# Do not escape otherwise

IgnoreLimit 50 1000
# This automatically omits words that appear too often in the files
# (these words are called stopwords). Specify a whole percentage
# and a number, such as "80 256". This omits words that occur in
# over 80% of the files and appear in over 256 files. Comment out
# to turn off auto-stopwording.

#IgnoreWords SwishDefault
# The IgnoreWords option allows you to specify words to ignore.
# Comment out for no stopwords; the word "SwishDefault" will
# include a list of default stopwords. Words should be separated by spaces
# and may span multiple directives.

IndexComments 0
# This option allows the user decide if to index the comments in the files
# default is 1. Set to 0 if comment indexing is not required.

##################################
# DIRECTIVES for FILESYSTEMS ONLY 
# Comment out if using HTTP
###################################

#IndexOnly .html .q
# Only files with these suffixes will be indexed.

NoContents .gif .xbm .au .mov .mpg .pdf .ps
# Files with these suffixes will not have their contents indexed -
# only their file names will be indexed.

#FileRules pathname contains .*dir1
#FileRules filename contains # % ~ .bak .orig .old old. 1
#FileRules title contains construction example pointers
#FileRules directory contains .htaccess
#FileRules filename is index
# Files matching the above criteria will *not* be indexed.
# The pattern matching uses the C library regex.h 

################################
# DIRECTIVES for HTTP METHOD ONLY
# Comment out if using FILESYSTEM
##################################

#MaxDepth 5
#(default 5)  This defines how many links the spider should
#follow before stopping.  A value of 0 configures the spider to
#traverse all links

#Delay 60
#(default 60)  The number of seconds to wait between issuing
#requests to a server.

#TmpDir /home/ghill/swishRon/
#(default /var/tmp)  The location of a writeable temp directory
#on your system.  The HTTP access method tells the Perl helper to place
#its files there.

#SpiderDirectory /home/ghill/swishRon/src/
#(default ./)  The location of the Perl helper
#script.  Remember, if you use a relative directory, it is relative to
#your directory when you run SWISH-E, not to the directory that SWISH-E
#is in.

#EquivalentServer http://library.berkeley.edu http://www.lib.berkeley.edu
#(default nothing)  This allows you to deal with
#servers that use respond to multiple DNS names.  Each line should have
#a list of all the method/names that should be considered equivalent. 
#If you have multiple directives, each one defines its own set of 
#equivalent servers.
Received on Wed Oct 7 11:08:42 1998