David,
I had a similar situation. Because some of our sites are dynamic in
nature, we chose to go with spidering. However, I found some
documentation around setting up spidering a little confusing (there was
a lot of it, it was just ordered a little weird). I think what the
documentation could use is a Spidering Getting Started Guide. They way
the documentation is right now, its kind of like piecing together a
puzzle.
Here's what I did to spider all the sites I needed:
First, create a swish.conf file:
# Example for spidering
# Use the "spider.pl" program included with Swish-e
IndexDir spider.pl
# Allow extra searching by title, path
Metanames swishtitle swishdocpath
# Only index .html .htm and .q files
IndexOnly .html .htm .txt
# Set StoreDescription for each parser
# to display context with search results
StoreDescription TXT* 10000
StoreDescription HTML* <body> 10000
# Define what site to index
SwishProgParameters ./spider.conf
Secondly, create a spider.conf file. See attached file (spider.conf)
for a sample that contains some sane defaults.
Now, run the command: swish-e -S prog -c swish.conf -v2
What that will do is call swish.conf, which in turn calls spider.conf.
The way I've got everything setup assumes you've got the proper
filtering installed for docs, xls, and pdf.
I hope this helps.
Chris Shaffer
-----Original Message-----
From: swish-e@sunsite3.berkeley.edu
[mailto:swish-e@sunsite3.berkeley.edu] On Behalf Of David Nickel
Sent: Friday, August 27, 2004 1:16 PM
To: Multiple recipients of list
Subject: [SWISH-E] Indexing University Site
We are trying to set up swish-e to index our universities web server. I
am=20
having trouble creating a config file that indexes all of our sites. We
have a main page and underneath we have pages for official departments.
example: www.example.edu
www.example.edu/dept1
www.example.edu/dept2
Should the IndexDir be set to www.example.edu or /path/to/web/root?
In help would be much appreciated. Thanks
David
*********************************************************************
Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
*********************************************************************
Received on Sun Aug 29 13:17:32 2004