Skip to main content.
home | support | download

Back to List Archive

spidering with swish

From: Lance Perry <lp_swish(at)not-real.lanceandgenoa.net>
Date: Wed Jan 05 2005 - 20:16:01 GMT
I am spidering a site (spidering is being called from the swish indexing).

The site contains .exe and .zip files. I DO NOT want those files to be
indexed (or even downloaded). Here is my command line for swish indexing:

    swish-e -S prog -c swish.config

How do I have it NOT index .exe and .zip files? (below is listed my config
files). I even have some entries in my robots.txt file that I thought would
keep the files from being spidered but that isn't working either.

lance

--swish.config--
#--- swish.config:

#--- where is the spider proggie?
IndexDir /home/perry/Soft/swish/lib/swish-e/spider.pl

#--- configuration for the spider
SwishProgParameters ccenter.config

#--- swish index file
IndexFile info.index

PropertyNames title

#--- grab the body to store (to be searchable)
StoreDescription HTML2 <body> 20000

#--- index all these guys
IndexContents HTML2 .html .htm .php .pdf .doc .xls .ppt

#--- File Filter for pdf files
FileFilter .pdf /home/perry/WebTools/bin/pdftohtml "'%p' -stdout -q -noframes"

#--- File Filter for doc files
FileFilter .doc /home/perry/WebTools/bin/catdoc "-s8859-1 -d8859-1 '%p'"

#--- File Filter for xls files
FileFilter .xls /home/perry/WebTools/bin/xlhtml "'%p'"

#--- File Filter for ppt files
FileFilter .ppt /home/perry/WebTools/bin/ppthtml "'%p'"
--end of swish.config--



--ccenter.config--
    my %ccenter = (
            email       => 'Lance.Perry@ourdomain.com',
            base_url    => 'http://our.domain.com/ccenter/',
            delay_sec   => '0',
            max_depth   => '1',
            credentials => 'username:password'
    );

    @servers = ( \%ccenter );
--end of ccenter.config--


--robots.txt--
User-agent: *
Disallow: /downloads/cisco-vpn/*.exe$

User-agent: *
Disallow: /downloads/cisco-vpn/*.zip$

User-agent: *
Disallow: /downloads/cisco-vpn/*.tar$

User-agent: *
Disallow: /downloads/cisco-vpn/*.gz$

User-agent: *
Disallow: /downloads/cisco-vpn/*.dmg$
--end of robots.txt--
Received on Wed Jan 5 12:16:02 2005