Skip to main content.
home | support | download

Back to List Archive

Re: HTTP indexing: config file corrected.

From: Ron Samuel Klatchko <rsk(at)not-real.corpmail.brightmail.com>
Date: Fri May 26 2000 - 22:56:33 GMT
arajamani@excite.com wrote:
>   Thanks for pointing out the errors. I have gone ahead and changed the
> config file and the HTTP indexing works just fine!( I have enclosed the
> modified config file ) However,it is unable to spider down the the links and
> index them too. All the links are a part of intra-net and are NOT visible to
> the WWW. Is  this what's preventing the spider from spidering down.
> THanks once again for your help.

The spider works by indexing the first page (depth 1).  It then finds
all links on that page that are on the same (or equivalent as defined in
the config file) server.  It then indexes each of those pages (depth 2)
and follows those links.  It does this until it reaches it's max depth
or all file on a server are indexed.

The most important thing is that it can only find pages that you tell it
to index or that it can find a URL on one of the pages it indexes.  If
your comment that they are "NOT visible to the WWW" means there are no
links to the pages, then no, they won't be indexed.  How would the
spider know they exist (and don't suggest that it look at the file
system, the HTTP method was built to index foreign sites where it has no
access to the fs).

moo
------------------------------------------------------------
        Ron Samuel Klatchko - Senior Software Jester
            Brightmail Inc - rsk@brightmail.com
Received on Fri May 26 18:58:55 2000