Skip to main content.
home | support | download

Back to List Archive

Re: Adding files from external site - suggestions?

From: Rob de Santos AFANA <rdesantos(at)not-real.afana.com>
Date: Wed Apr 14 2004 - 16:06:09 GMT
Hi All,

I'm still working this issue since my last post on 3/9/04 and I have
made some progress but I now need to get Swish-e to index the files,
which it is not doing.... to recap:

> I want to include data [html] from another site in my 
> index.  My commission sales of the other sites products go 
> thru the other site but data on available products doesn't 
> show up in my index.  
> 
> My current plan is as follows:
> 
> Use wget to mirror the section of the other site over to 
> mine.  This will give a set of files under 
> http://www.afana.com/www.othersite.com/afl/

This is done.  All the files are .asp files but saved as .asp.html to
make them visible to Swish-e.

> Then run Swish-E against that.  Then on display of the index 
> I will need to transform the URL's, presumably with 
> ReplaceRules??  e.g.:
> 
> I will have an URL such as: 
> http://www.afana.com/www.othersite.com/afl/video_detail.asp?vid_id=338

> and have to transform it to: 
>
http://www.othersite.com/cgi-bin/at.pl?a=195711&e=/afl/video_detail.asp?
vid_id=338

I have these regex rules in place in my swconfig.conf:

ReplaceRules regex
!afana.com/www.sportsdelivered.com/(.+).html!sportsdelivered.com/cgi-bin
/at.pl?a=195711&e=/$1!
ReplaceRules regex !http: //www.sportsdelivered.com/afl/(.+)!http:
//www.sportsdelivered.com/cgi-bin/at.pl?a=195711&e=/$1!

(deliberate space inserted after http: above to avoid e-mail program
converting this to a URL)
And to the extent I can test them they do the job.  

The problem now is that it does not appear that Swish-e is indexing the
necessary directory in total:
http://www.afana.com/www.othersite.com/afl/

When I do a search to look for files with sportsdelived.com in the URL
the only thing it finds is the index file:
http://www.sportsdelivered.com/cgi-bin/at.pl?a=195711&e=/afl/index.asp
(which is the correct regex transform of:
http://www.afana.com/www.sportsdelived.com/afl/index.asp.html )

i.e.: from an actual search using
http://www.afana.com/swish-e/lib/swish-e/swish.cgi it finds only this:
2 Sports Delivered -- rank: 940 
Australian Football Video AFL Name a Game HOME MORE VIDEOS CONTACT
Player Profiles Compilations Ansett Cup WEG Grand Finals Seasons
Highlights Team Highlights Double Packs Triple Packs DVD Club Histories
Club Gift Packs Adelaide  Brisbane  Carlton  Collingwood  Essendon
Fremantle  Geelong  Hawthorn  Kangaroos  Melbourne  Port Adelaide
Richmond  St.Kilda  Sydney  W. Bulldogs  W. Coast Search Restricted by
category 
Last Modified Date:  
Document Size: 72087 
Document Path:
http://www.sportsdelivered.com/cgi-bin/at.pl?a=195711&e=/afl/index.asp 

Apparently, the other 600 files in my directory are skipped.  Because
they are extracted from the dynamically generated pages at the other
site they aren't necessarily linked in a "spiderable" chain from the
index file but all of them need to be indexed.  

So, any thoughts on what the best way to go about this is?  Do I run
another index job and then merge the indexes or can I do something to
get these included?  Here's my index cron job at present:

$HOME/public_html/swish-e/bin/swish-e -S prog -c
$HOME/public_html/swish-e/bin/swconfig.conf

and swconfig.conf contains this:
---
IndexDir spider.pl

NoContents .gif .jpg .png .cgi .pl .log .jar .ico .js .class .log .sql
.csv .dir .idx .dat
IndexContents HTML* .htm .html .shtm .shtml .css
IndexContents TXT* .txt .text
IndexContents XML* .xml .wml .rdf .rss
DefaultContents HTML

SwishProgParameters
/home/afana/public_html/swish-e/lib/swish-e/SwishSpiderConfig.pl
http://www.afana.com 

IndexReport 1

ParserWarnLevel 1
IndexFile /home/afana/public_html/swish-e/website.index
obeyRobotsNoIndex yes
---

Any ideas?

-Rob de Santos
-Columbus, Ohio USA
Chairman of the Board,
Australian Football Association of North America (AFANA)
ph: 1-888-4AFANA1 (North America) (1-888-423-2621)
ph: 1-614-338-0002 (outside NA)  
e-mail: rdesantos(at)not-real.afana.com   web: <http://www.afana.com>
Contents of this message may not be posted
to the web or "blogged" without prior permission.
Received on Wed Apr 14 09:16:47 2004