Bill,
I tried your suggestion and ran this:
----------------------------------------------------------------------------------------------------------------------------------------
C:\jakarta-tomcat-3.2\bin>perl swishspider .
http://www.lib.berkeley.edu/~ghill/spider.html
----------------------------------------------------------------------------------------------------------------------------------------
Doing this created 3 files (contents, links, response). I am getting the
same data you mention below.
I tried to run the simple config file like this:
----------------------------------------------------------------------------------------------------------------------------------------
C:\jakarta-tomcat-3.2\bin>swish-e -S http -c c
Indexing Data Source: "HTTP-Crawler"
retrieving http://www.lib.berkeley.edu/~ghill/spider.html (0)...
Removing very common words... no words removed.
Writing main index... no unique words indexed.
Writing file index... no files indexed.
Running time: 1 second.
Indexing done!
----------------------------------------------------------------------------------------------------------------------------------------
Not the result I was looking for.
I even tried this: (create an index based on the specified file)
----------------------------------------------------------------------------------------------------------------------------------------
C:\jakarta-tomcat-3.2\bin>swish-e -S http -i
http://www.lib.berkeley.edu/~ghill/spider.html
Indexing Data Source: "HTTP-Crawler"
retrieving http://www.lib.berkeley.edu/~ghill/spider.html (0)...
Removing very common words... no words removed.
Writing main index... no unique words indexed.
Writing file index... no files indexed.
Running time: 1 minute.
Indexing done!
----------------------------------------------------------------------------------------------------------------------------------------
These results have me extremely frustrated. I'm about ready to go and use
a commercial product or hosting service.
You guys have been of great support but the product is not easily
configurable.
Bill Moseley <moseley@hank.org>
Sent by: swish-e@sunsite.berkeley.edu
04/26/01 03:59 PM
Please respond to moseley
To: Multiple recipients of list <swish-e@sunsite.berkeley.edu>
cc:
Subject: [SWISH-E] Re: no files being indexed using http,,,,,
At 12:30 PM 04/26/01 -0700, Kevin.Fay@CommerceQuest.com wrote:
>>>>
<excerpt>
I changed the declaration path in swishspider.pl and have it pointed to
my perl executable.
Just tried to run swish-e again and.............
C:\jakarta-tomcat-3.2\bin>swish-e -S http -c ../src/user.config
Indexing Data Source: "HTTP-Crawler"
retrieving http://www.lib.berkeley.edu/~ghill/spider.html (0)...
Removing very common words... no words removed.
Writing main index... no unique words indexed.
Writing file index... no files indexed.
Running time: 1 minute.
Indexing done!
</excerpt><<<<<<<<
Break the problem down into smaller parts.
~/swish-e/src > rm ..contents ..links ..response
~/swish-e/src > perl swishspider .
http://www.lib.berkeley.edu/~ghill/spider.html
(the above is all one line)
~/swish-e/src > cat ..contents
<<html>
<<HEAD>
InHead
<<TITLE>InTitleAndHead<</Title>
<<BODY>
InBody
<<H1>InHeaderAndBody<</H1>
<<a href="spider2.html">
link to spider2 on library
<</BODY>
<</HTML>
~/swish-e/src > cat ..response
200
text/html
Then on windows:
C:\swish204\src>perl swishspider .
http://www.lib.berkeley.edu/~ghill/spider.html
C:\swish204\src>type ..contents
<<html>
<<HEAD>
InHead
<<TITLE>InTitleAndHead<</Title>
<<BODY>
InBody
<<H1>InHeaderAndBody<</H1>
<<a href="spider2.html">
link to spider2 on library
<</BODY>
<</HTML>
Does that work for you? If so, then maybe your config file. Use a
simple config file:
~/swish-e/src > cat c
IndexDir http://www.lib.berkeley.edu/~ghill/spider.html
MaxDepth 10
Delay 1
TmpDir .
~/swish-e/src > ./swish-e -S http -c c
Indexing Data Source: "HTTP-Crawler"
Indexing http://www.lib.berkeley.edu/~ghill/spider.html..
retrieving http://www.lib.berkeley.edu/~ghill/spider.html (0)...
- Using DEFAULT filter - (9 words)
retrieving http://www.lib.berkeley.edu/~ghill/spider2.html (1)...
- Using DEFAULT filter - (4 words)
10 unique words indexed.
2 files indexed.
Running time: 2 seconds.
Indexing done!
Bill Moseley
mailto:moseley@hank.org
Received on Thu Apr 26 21:42:20 2001