Skip to main content.
home | support | download

Back to List Archive

RE: AutoSwish - How index non-linked pages

From: Chris Humphries <ChrisJMH(at)not-real.vermilion99.freeserve.co.uk>
Date: Sat Feb 26 2000 - 12:09:32 GMT
Bob,

I have been indexing just such a site recently: one where there are 
hundredss of files, no filesystem access and no html links to allow 
spidering. This site was created by my current employers. At my request 
they gave me an ASCII file with all the urls of the site's documents. I 
have been using my own Perl programs to read those urls, create custom 
user.config files on the fly, and index them like that. So far this has 
worked pretty well.

I don't know if there is a general solution to your problem. Unless you 
actually know the names of the files on the site, it is very hard to get at 
them, even if free access is allowed. I admit that I am too new to the 
world of Web servers, HTML and Perl to offer any really good advice, so 
hopefully someone else out there might have something useful to say about 
this.

Chris Humphries

-----Original Message-----
From:	PropheZine Owner [SMTP:bob@prophezine.com]
Sent:	Saturday, February 26, 2000 11:56 AM
To:	Multiple recipients of list
Subject:	[SWISH-E] AutoSwish - How index non-linked pages

Hi:

Lets say I want to index a site that has 1,000 pages that there are no 
links
to.  An archive that is currently indexed with swish file system method.

How would these pages be indexed using the HTTP method?  What would be a
good method?  If I do not have an index.html page then the web server
generated index would be links to all the pages and they would be linked.
But, I want an index.html page there so people can not get the list of 
files
in the directory.

Is there a good method that someone is already using?

Thanks.

Bob

-----Original Message-----
From: swish-e@sunsite.berkeley.edu
[mailto:swish-e@sunsite.berkeley.edu]On Behalf Of Chris Humphries
Sent: Saturday, February 26, 2000 6:42 AM
To: Multiple recipients of list
Subject: [SWISH-E] Re: Swish-E and HTML documents with frames


Dear Ron,

You wrote: "But if you find one of the pages that is indirectly referenced,
you get the page only."

This is very true, and if one were spidering indiscriminately, it would be
a problem because there is probably no way of knowing that the page you had
found *was* indirectly referenced. However, most of my indexing so far has
been just the first page of a Web site, which means that my approach to
reading through the frames is probably safe. Each Web site will already
have been looked at by a human being and its basic structure understood.

If you can think of a case you would like to see handled that isn't handled
by the approach I am using, I would really appreciate it if you could
supply a url for me to try out.

Many thanks,

Chris Humphries

-----Original Message-----
From:	Ron Samuel Klatchko [SMTP:rsk@brightmail.com]
Sent:	Saturday, February 26, 2000 1:29 AM
To:	ChrisJMH@vermilion99.freeserve.co.uk
Cc:	Multiple recipients of list
Subject:	Re: [SWISH-E] Re: Swish-E and HTML documents with frames

Chris Humphries wrote:
> The way my system works, all the "frame src" links are read to create one
> big file, and *any* "a href" links found in any of those files are
returned
> as if they were from that one big file. This means that to get at any <A>
> tags in the HTML pages you describe, one would need to set the spider to
> read to a depth of 2.

That's not what I'm worried about.  If you find one of the pages
directly referenced from the frameset, then you get the entire
frameset.  But if you find one of the pages that is indirectly
referenced, you get the page only.  Is that behavior acceptable?  The
first one is nicer but the second one will be more common.

moo
------------------------------------------------------------
           Ron Samuel Klatchko - Software Jester
            Brightmail Inc - rsk@brightmail.com
Received on Sat Feb 26 07:13:16 2000