Skip to main content.
home | support | download

Back to List Archive

Re: no files being indexed using http,,,,,

From: <Kevin.Fay(at)not-real.CommerceQuest.com>
Date: Thu Apr 26 2001 - 21:41:13 GMT
Bill,

I tried your suggestion and ran this:
----------------------------------------------------------------------------------------------------------------------------------------
C:\jakarta-tomcat-3.2\bin>perl swishspider . 
http://www.lib.berkeley.edu/~ghill/spider.html
----------------------------------------------------------------------------------------------------------------------------------------

Doing this created 3 files (contents, links, response). I am getting the 
same data you mention below.

I tried to run the simple config file like this:
----------------------------------------------------------------------------------------------------------------------------------------
C:\jakarta-tomcat-3.2\bin>swish-e -S http -c c
Indexing Data Source: "HTTP-Crawler"
retrieving http://www.lib.berkeley.edu/~ghill/spider.html (0)...

Removing very common words... no words removed.
Writing main index... no unique words indexed.
Writing file index... no files indexed.
Running time: 1 second.
Indexing done!
----------------------------------------------------------------------------------------------------------------------------------------

Not the result I was looking for.

I even tried this: (create an index based on the specified file)
----------------------------------------------------------------------------------------------------------------------------------------
C:\jakarta-tomcat-3.2\bin>swish-e -S http -i 
http://www.lib.berkeley.edu/~ghill/spider.html
Indexing Data Source: "HTTP-Crawler"
retrieving http://www.lib.berkeley.edu/~ghill/spider.html (0)...

Removing very common words... no words removed.
Writing main index... no unique words indexed.
Writing file index... no files indexed.
Running time: 1 minute.
Indexing done!
----------------------------------------------------------------------------------------------------------------------------------------

These results have me extremely frustrated. I'm about ready to go and use 
a commercial product or hosting service.
You guys have been of great support but the product is not easily 
configurable.






Bill Moseley <moseley@hank.org>
Sent by: swish-e@sunsite.berkeley.edu
04/26/01 03:59 PM
Please respond to moseley

 
        To:     Multiple recipients of list <swish-e@sunsite.berkeley.edu>
        cc: 
        Subject:        [SWISH-E] Re: no files being indexed using http,,,,,


At 12:30 PM 04/26/01 -0700, Kevin.Fay@CommerceQuest.com wrote: 

>>>>

<excerpt>

I changed the declaration path in swishspider.pl and have it pointed to
my perl executable. 


Just tried to run swish-e again and............. 

C:\jakarta-tomcat-3.2\bin>swish-e -S http -c ../src/user.config 

Indexing Data Source: "HTTP-Crawler" 

retrieving http://www.lib.berkeley.edu/~ghill/spider.html (0)... 


Removing very common words... no words removed. 

Writing main index... no unique words indexed. 

Writing file index... no files indexed. 

Running time: 1 minute. 

Indexing done! 

</excerpt><<<<<<<<


Break the problem down into smaller parts.


~/swish-e/src > rm ..contents ..links ..response

~/swish-e/src > perl swishspider .
http://www.lib.berkeley.edu/~ghill/spider.html


(the above is all one line)


~/swish-e/src > cat ..contents 
 

<<html>

<<HEAD>

InHead

<<TITLE>InTitleAndHead<</Title>


<<BODY>

InBody

<<H1>InHeaderAndBody<</H1>

   <<a href="spider2.html"> 

   link to spider2 on library

<</BODY>

<</HTML>



~/swish-e/src > cat ..response

200

text/html


Then on windows:


C:\swish204\src>perl swishspider .
http://www.lib.berkeley.edu/~ghill/spider.html


C:\swish204\src>type ..contents

<<html>

<<HEAD>

InHead

<<TITLE>InTitleAndHead<</Title>


<<BODY>

InBody

<<H1>InHeaderAndBody<</H1>

   <<a href="spider2.html">

   link to spider2 on library

<</BODY>

<</HTML>


Does that work for you?  If so, then maybe your config file.  Use a
simple config file:


~/swish-e/src > cat c

IndexDir http://www.lib.berkeley.edu/~ghill/spider.html

MaxDepth 10

Delay 1

TmpDir .



~/swish-e/src > ./swish-e -S http -c c

Indexing Data Source: "HTTP-Crawler"

Indexing http://www.lib.berkeley.edu/~ghill/spider.html..

retrieving http://www.lib.berkeley.edu/~ghill/spider.html (0)...

 - Using DEFAULT filter -  (9 words)

retrieving http://www.lib.berkeley.edu/~ghill/spider2.html (1)...

 - Using DEFAULT filter -  (4 words)


10 unique words indexed.

2 files indexed.

Running time: 2 seconds.

Indexing done!






Bill Moseley

mailto:moseley@hank.org
Received on Thu Apr 26 21:42:20 2001