Skip to main content.
home | support | download

Back to List Archive

Re: Can I just get a list of html docs

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Jan 15 2003 - 15:14:57 GMT
On Wed, 15 Jan 2003, Vendetti, Jeff wrote:

> ------_=_NextPart_001_01C2BCA3.EBD8CBA0
> Content-Type: text/plain
> 
> I'm looking to go thru a web site, and produce a list of all the html
> documents on the site, rather than an index.  Is there an option to do that?
> I didn't know if there was a way to electronically read the index created
> and just produce of list of all the html documents referred to in the index.

With swish?

Once you have the index built:

   ./swish-e -w not sksksks -x 'swishdocpath\n'

you might use double-quotes if on Windows.

If you just want to spider a site, you can run spider.pl 


   > SPIDER_DEBUG=url SPIDER_QUIET=1 \
   ./spider.pl default http://localhost/ >/dev/null

but the output is not just the URLs but it would be easy to modify.

I would think something like wget might also be able to do that.

> 
> Thanks
> 
> ------_=_NextPart_001_01C2BCA3.EBD8CBA0
> Content-Type: text/html
> Content-Transfer-Encoding: quoted-printable
> 
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
> <HTML>
> <HEAD>
> <META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
> charset=3DUS-ASCII">
> <META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version =
> 5.5.2654.89">
> <TITLE>Can I just get a list of html docs</TITLE>
> </HEAD>
> <BODY>
> 
> <P><FONT SIZE=3D2 FACE=3D"Arial">I'm looking to go thru a web site, and =
> produce a list of all the html documents on the site, rather than an =
> index.&nbsp; Is there an option to do that?&nbsp; I didn't know if =
> there was a way to electronically read the index created and just =
> produce of list of all the html documents referred to in the =
> index.</FONT></P>
> 
> <P><FONT SIZE=3D2 FACE=3D"Arial">Thanks</FONT>
> </P>
> 
> </BODY>
> </HTML>
> ------_=_NextPart_001_01C2BCA3.EBD8CBA0--
> 
> 
> 
> *********************************************************************
> Due to deletion of content types excluded from this list by policy,
> this multipart message was reduced to a single part, and from there
> to a plain text message.
> *********************************************************************
> 
> 

-- 
Bill Moseley moseley@hank.org
Received on Wed Jan 15 15:15:18 2003