Skip to main content.
home | support | download

Back to List Archive

HTTP indexing: internal site spidering

From: <arajamani(at)not-real.excite.com>
Date: Sat May 27 2000 - 00:13:14 GMT
Hello everone and Mr.Klatchko,
  I agree with you, Mr.Klatchko, when you say that in the HTTP method we
dont look at the file system at all. Also, when I said "NOT visible to the
WWW" I meant that the web pages I want indexed are part of a company
INTRANET,in that, there definitely are links on the main page( page to start
spidering from) only to web sites that are within the company intranet and
not to any WWW sites.I want these internal links to be spidered. I really
appreciate your taking time out to answer my questions. 
Sincerely,
Ashok 


On Fri, 26 May 2000 15:56:32 -0700 (PDT), rsk@corpmail.brightmail.com wrote:

>  arajamani@excite.com wrote:
>  >   Thanks for pointing out the errors. I have gone ahead and changed the
>  > config file and the HTTP indexing works just fine!( I have enclosed the
>  > modified config file ) However,it is unable to spider down the the
links and
>  > index them too. All the links are a part of intra-net and are NOT
visible to
>  > the WWW. Is  this what's preventing the spider from spidering down.
>  > THanks once again for your help.
>  
>  The spider works by indexing the first page (depth 1).  It then finds
>  all links on that page that are on the same (or equivalent as defined in
>  the config file) server.  It then indexes each of those pages (depth 2)
>  and follows those links.  It does this until it reaches it's max depth
>  or all file on a server are indexed.
>  
>  The most important thing is that it can only find pages that you tell it
>  to index or that it can find a URL on one of the pages it indexes.  If
>  your comment that they are "NOT visible to the WWW" means there are no
>  links to the pages, then no, they won't be indexed.  How would the
>  spider know they exist (and don't suggest that it look at the file
>  system, the HTTP method was built to index foreign sites where it has no
>  access to the fs).
>  
>  moo
>  ------------------------------------------------------------
>          Ron Samuel Klatchko - Senior Software Jester
>              Brightmail Inc - rsk@brightmail.com





_______________________________________________________
Get 100% FREE Internet Access powered by Excite
Visit http://freelane.excite.com/freeisp
Received on Fri May 26 20:15:37 2000