Skip to main content.
home | support | download

Back to List Archive

HTTP indexing: clarification of 'intranet'

From: <arajamani(at)not-real.excite.com>
Date: Wed May 31 2000 - 16:22:38 GMT
Hello all and Mr.Klatchko,
   Let me clarify my statements to dispel any confusion. I used the word
"intranet" to mean a collect of sites(let me use the word 'sites' for
now)that are only visible/accesible from within the company premises and are
NOT visible to the external world.In other words,these sites are not a part
of the world wide web(for reasons of security,of course).
   What I would like SWISH-E to therefore do is to index these internal
sites.Most of the links on the main page of this company 'intranet' lead to
other sites/pages within the 'intranet' and VERY FEW of the links lead to
pages that are part of the World Wide Web(and that are not a part of the
company 'intranet'). I would like SWISH-E to access the 'intranet'
sites/pages and ignore the WWW sites. 
   I must at this point mention that,as a part of testing, when I ran the
HTTP spidering on my own web-site( which IS a part of the WWW and NOT a part
of the company intranet) it worked like a charm. From the company, we would
like SWISH-E to do exactly the opposite. 
Thanking one and all,
Sincerely,
Ashok Rajamani





On Tue, 30 May 2000 16:54:43 -0700 (PDT), rsk@corpmail.brightmail.com wrote:

>  
>  > I meant that the web pages I want indexed are part of a company
>  > INTRANET,in that, there definitely are links on the main page( page to
start
>  > spidering from) only to web sites that are within the company intranet
>  
>  Okay, in this sentence you talk about pages and sites.  I need to know
>  whether your are exactly describing your environment or whether you are
>  being inexact in your phrasing.  What is your definition of "web site"
>  as used in the above question?  For that matter, what is your definition
>  of an "intranet"?
>  
>  moo
>  
>  arajamani@excite.com wrote:
>  > 
>  > Hello everone and Mr.Klatchko,
>  >   I agree with you, Mr.Klatchko, when you say that in the HTTP method
we
>  > dont look at the file system at all. Also, when I said "NOT visible to
the
>  > WWW" I meant that the web pages I want indexed are part of a company
>  > INTRANET,in that, there definitely are links on the main page( page to
start
>  > spidering from) only to web sites that are within the company intranet
and
>  > not to any WWW sites.I want these internal links to be spidered. I
really
>  > appreciate your taking time out to answer my questions.
>  > Sincerely,
>  > Ashok
>  > 
>  > On Fri, 26 May 2000 15:56:32 -0700 (PDT), rsk@corpmail.brightmail.com
wrote:
>  > 
>  > >  arajamani@excite.com wrote:
>  > >  >   Thanks for pointing out the errors. I have gone ahead and
changed the
>  > >  > config file and the HTTP indexing works just fine!( I have
enclosed the
>  > >  > modified config file ) However,it is unable to spider down the the
>  > links and
>  > >  > index them too. All the links are a part of intra-net and are NOT
>  > visible to
>  > >  > the WWW. Is  this what's preventing the spider from spidering
down.
>  > >  > THanks once again for your help.
>  > >
>  > >  The spider works by indexing the first page (depth 1).  It then
finds
>  > >  all links on that page that are on the same (or equivalent as
defined in
>  > >  the config file) server.  It then indexes each of those pages (depth
2)
>  > >  and follows those links.  It does this until it reaches it's max
depth
>  > >  or all file on a server are indexed.
>  > >
>  > >  The most important thing is that it can only find pages that you
tell it
>  > >  to index or that it can find a URL on one of the pages it indexes. 
If
>  > >  your comment that they are "NOT visible to the WWW" means there are
no
>  > >  links to the pages, then no, they won't be indexed.  How would the
>  > >  spider know they exist (and don't suggest that it look at the file
>  > >  system, the HTTP method was built to index foreign sites where it
has no
>  > >  access to the fs).
>  > >
>  > >  moo
>  > >  ------------------------------------------------------------
>  > >          Ron Samuel Klatchko - Senior Software Jester
>  > >              Brightmail Inc - rsk@brightmail.com
>  > 
>  > _______________________________________________________
>  > Get 100% FREE Internet Access powered by Excite
>  > Visit http://freelane.excite.com/freeisp
>  
>  -- 
>  ------------------------------------------------------------
>          Ron Samuel Klatchko - Senior Software Jester
>              Brightmail Inc - rsk@brightmail.com





_______________________________________________________
Get 100% FREE Internet Access powered by Excite
Visit http://freelane.excite.com/freeisp
Received on Wed May 31 12:36:52 2000