Skip to main content.
home | support | download

Back to List Archive

RE: HTTP Crawler

From: Hsiao Ketung Contr 61 CS/SCBN <KETUNG.HSIAO(at)not-real.LOSANGELES.AF.MIL>
Date: Wed May 01 2002 - 23:37:59 GMT
Bill,

Thanks for the response.

I've just runed ./swishspider . http://swish-e.org/index.html from the src
directory.
It had run correctly because I get:
drwxr-xr-x   3 root     staff       1536 May  1 16:21 .
drwxr-xr-x   8 root     staff        512 Apr 30 09:42 ..
-rw-r--r--   1 root     other       5321 May  1 16:21 ..contents
-rw-r--r--   1 root     other        638 May  1 16:21 ..links
-rw-r--r--   1 root     other         14 May  1 16:21 ..response
in the src directory.

But if I run the following (from src directory)
./swishspider . http://my-intranet-server-name/tmp.html.

The content in ..links is unchanged.
So, the run for the intranet URL is not working.
How do I get swishspider to to run intranet also ?

I've just search discussion group by search text "swishspider intranet",
I've found 2 links.
But they don't have problem like mine.
I've just tried the swishspider using our intranet IP address in the URL and
the ..links is unchanged.

Can anyone please shed some light on this one ?

>$url =~ s/http\:\/\/www\.losangeles\.af\.mil\///;
>	into  the while loop in
>	sub search_parse.
Yes, the above is Perl code.  The above code is to blank out
www.losangeles.af.mil from the $url variable.



-----Original Message-----
From: Bill Moseley [mailto:moseley@hank.org]
Sent: Wednesday, May 01, 2002 3:56 PM
To: KETUNG.HSIAO@LOSANGELES.AF.MIL; Multiple recipients of list
Subject: Re: [SWISH-E] HTTP Crawler


At 03:43 PM 05/01/02 -0700, Hsiao Ketung Contr 61 CS/SCBN wrote:
>I've been trying to get swish-e HTTP crawler working for the last 2 days.
>The HTTP crawler works if the IndexDir  is set to a URL on my own server 
>where I'm running the swish-e.
>
>It's when I set the IndexDir to URL other than my own server that I get
>"no word indexes"  type of output.

If you are using the -S http method then swish is using a perl helper
program called swishspider.  You can run this program alone to see if it's
fetching docs.

~/swish-e/src > ./swishspider
Usage: SwishSpider localpath url

~/swish-e/src > ./swishspider . http://swish-e.org/index.html

~/swish-e/src > ll -t | head
total 52672
-rw-r--r--   1 lii      users        5321 May  1 15:52 ..contents
-rw-r--r--   1 lii      users         638 May  1 15:52 ..links
-rw-r--r--   1 lii      users          14 May  1 15:52 ..response

that will tell you if it can fetch the remote doc.



>Also,  I have to modify the Perl script in cgi-bin to make the HTTP crawler
>result 
>show up correclty. I have to add this line:
>$url =~ s/http\:\/\/www\.losangeles\.af\.mil\///;
>	into  the while loop in
>	sub search_parse.

Don't really follow that.  You may be describing a cgi script I'm not
familiar with.


-- 
Bill Moseley
mailto:moseley@hank.org
Received on Wed May 1 23:38:10 2002