I have been indexing just such a site recently: one where there are
hundredss of files, no filesystem access and no html links to allow
spidering. This site was created by my current employers. At my request
they gave me an ASCII file with all the urls of the site's documents. I
have been using my own Perl programs to read those urls, create custom
user.config files on the fly, and index them like that. So far this has
worked pretty well.
I don't know if there is a general solution to your problem. Unless you
actually know the names of the files on the site, it is very hard to get at
them, even if free access is allowed. I admit that I am too new to the
world of Web servers, HTML and Perl to offer any really good advice, so
hopefully someone else out there might have something useful to say about
From: PropheZine Owner [SMTP:email@example.com]
Sent: Saturday, February 26, 2000 11:56 AM
To: Multiple recipients of list
Subject: [SWISH-E] AutoSwish - How index non-linked pages
Lets say I want to index a site that has 1,000 pages that there are no
to. An archive that is currently indexed with swish file system method.
How would these pages be indexed using the HTTP method? What would be a
good method? If I do not have an index.html page then the web server
generated index would be links to all the pages and they would be linked.
But, I want an index.html page there so people can not get the list of
in the directory.
Is there a good method that someone is already using?
[mailto:firstname.lastname@example.org]On Behalf Of Chris Humphries
Sent: Saturday, February 26, 2000 6:42 AM
To: Multiple recipients of list
Subject: [SWISH-E] Re: Swish-E and HTML documents with frames
You wrote: "But if you find one of the pages that is indirectly referenced,
you get the page only."
This is very true, and if one were spidering indiscriminately, it would be
a problem because there is probably no way of knowing that the page you had
found *was* indirectly referenced. However, most of my indexing so far has
been just the first page of a Web site, which means that my approach to
reading through the frames is probably safe. Each Web site will already
have been looked at by a human being and its basic structure understood.
If you can think of a case you would like to see handled that isn't handled
by the approach I am using, I would really appreciate it if you could
supply a url for me to try out.
From: Ron Samuel Klatchko [SMTP:email@example.com]
Sent: Saturday, February 26, 2000 1:29 AM
Cc: Multiple recipients of list
Subject: Re: [SWISH-E] Re: Swish-E and HTML documents with frames
Chris Humphries wrote:
> The way my system works, all the "frame src" links are read to create one
> big file, and *any* "a href" links found in any of those files are
> as if they were from that one big file. This means that to get at any <A>
> tags in the HTML pages you describe, one would need to set the spider to
> read to a depth of 2.
That's not what I'm worried about. If you find one of the pages
directly referenced from the frameset, then you get the entire
frameset. But if you find one of the pages that is indirectly
referenced, you get the page only. Is that behavior acceptable? The
first one is nicer but the second one will be more common.
Ron Samuel Klatchko - Software Jester
Brightmail Inc - firstname.lastname@example.org
Received on Sat Feb 26 07:13:16 2000