Skip to main content.
home | support | download

Back to List Archive

HTTP spider in C - works on NT.

From: Mark Gaulin <gaulin(at)not-real.globalspec.com>
Date: Fri Apr 23 1999 - 18:29:22 GMT
Hi
My first real attempt to use the HTTP indexing method was taking too long
so I hacked together something to replace the perl "swishspider.pl" method
of fetching pages and finding links. The program is written in C (MSVC).  I
don't expect it compile without modification under any form of unix but I
figure someone might be motivated to port it. Let me know if you want to
give is a shot and I'll email the code.

It is definitely faster... I indexed 500 files in 1.5 minutes, which was a
big improvement over the perl version.

BIG DISCLAIMER:
The code is a quick hack... it looks for <A ... href="<url>" ...> and
<FRAME ... src="<url>" ...>
tags and expects to see double quotes around the url. It does try to skip
comments but I would not bet on a clear parse of incorrect or
unconventional HTML.  All I can say for sure is that it worked on my pages.

If you are running NT (or maybe even Win98 or Win95) you can just use the
binary, which
is at ftp://ftp.designinfo.com/GetPage.exe.  

To use it change your swish.config file to include the following line:
PerlPath <path-to-the-program>\GetPage.exe

GetPage is designed to notice that it is being called like swishspider.pl
and does the same thing (and creates a couple of extra temp files thought
ought to be cleaned up too).

	Mark
Received on Fri Apr 23 11:29:37 1999