Skip to main content.
home | support | download

Back to List Archive

Re: following xml relative links with spider.pl

From: Cas Tuyn <cas.tuyn(at)not-real.gmail.com>
Date: Tue Jan 16 2007 - 16:20:23 GMT
Brian,

OK, I overlooked you are using XML. We index html, pdf and doc, no
xml, and they all contain real hyperlinks.

We use the way of working that anything gets converted to HTML before
being offered to the indexer. Maybe this means that any links are also
converted to HTML links via your style sheet and then followed, but I
doubt it as spidering and content indexing are two separate processes.
Maybe the usual experts can confirm this?

You may need to convert the XML to HTML first, and index that HTML website.

Regards,

Cas


On 1/16/07, Brian Ling <brian_ling_gandj@yahoo.com> wrote:
> Hi Cas,
>
> I don't want to follow the subversion.tigris.org link
> just the ones from the subversion dirs that are in the
> form:
>
> <dir name="dtupdates" href="dtupdates/" />
>
> As shown in the spider.pl output at the end of the
> email ;-)
>
> Cheers,
>
> Brian
>
>
>
>
>
> --- Cas Tuyn <cas.tuyn@gmail.com> wrote:
>
> > Brian,
> >
> > The spider stays within the start-domain
> > localhost/svn, otherwise it
> > could go on and index the whole Internet. There is a
> > setting
> > (follow-hosts or something) that allows you to say
> > that links to
> > subversion.tigris.org may be followed. Also look at
> > same-hosts if
> > these two hosts are actually equal but have a
> > different domain (like
> > www.tigirs.org and tigris.org).
> >
> > Regards,
> >
> > Cas
> >
> >
> > On 1/16/07, Brian Ling <brian_ling_gandj@yahoo.com>
> > wrote:
> > > Hi all,
> > >
> > > I've just started using swish-e so sorry if this
> > is a
> > > bit newbie.
> > >
> > > I want to index a subversion repository via it's
> > > web/apache front end, but I can't seem to get
> > > spider.pl to follow the links in the default
> > > subversion output.
> > >
> > > I'm calling the spider directly with:
> > > /usr/local/lib/swish-e/spider.pl ./spider.conf it
> > > finds and outputs the main subversion page (output
> > at
> > > end of mail) but doesn't follow any of the links.
> > > Everything appeared to install OK. I'm on OS X
> > 10.4.8
> > > What am I missing?
> > >
> > > spider.conf:
> > >     @servers = (
> > >         {
> > >                 email       => 'test@test.co.uk',
> > >                 base_url    =>
> > > 'http://localhost/svn/',
> > >                 same_hosts  => [ '127.0.0.1' ],
> > >                 use_default_config  => 1,
> > >                 link_tags   => [qw/ a frame dir
> > /],
> > >         },
> > >     );
> > >     1;
> > >
> > > output from spider.pl:
> > >
> > > /usr/local/lib/swish-e/spider.pl: Reading
> > parameters
> > > from './spider.conf'
> > > Path-Name: http://localhost/svn/
> > > Content-Length: 1232
> > > Document-Type: xml*
> > >
> > > <?xml version="1.0"?>
> > > <?xml-stylesheet type="text/xsl"
> > > href="/xslt/svnindex.xsl"?>
> > > <!DOCTYPE svn [
> > >   <!ELEMENT svn   (index)>
> > >   <!ATTLIST svn   version CDATA #REQUIRED
> > >                   href    CDATA #REQUIRED>
> > >   <!ELEMENT index (updir?, (file | dir)*)>
> > >   <!ATTLIST index name    CDATA #IMPLIED
> > >                   path    CDATA #IMPLIED
> > >                   rev     CDATA #IMPLIED>
> > >   <!ELEMENT updir EMPTY>
> > >   <!ELEMENT file  EMPTY>
> > >   <!ATTLIST file  name    CDATA #REQUIRED
> > >                   href    CDATA #REQUIRED>
> > >   <!ELEMENT dir   EMPTY>
> > >   <!ATTLIST dir   name    CDATA #REQUIRED
> > >                   href    CDATA #REQUIRED>
> > > ]>
> > > <svn version="1.3.0 (r17949)"
> > >      href="http://subversion.tigris.org/">
> > >   <index rev="170" path="/">
> > >     <dir name="SubversionNotes"
> > > href="SubversionNotes/" />
> > >     <dir name="altirsCustomInventory"
> > > href="altirsCustomInventory/" />
> > >     <dir name="appsMan" href="appsMan/" />
> > >     <dir name="artwork" href="artwork/" />
> > >     <dir name="bootDVD-CD" href="bootDVD-CD/" />
> > >     <dir name="docs" href="docs/" />
> > >     <dir name="dtupdates" href="dtupdates/" />
> > >     <dir name="localMachine" href="localMachine/"
> > />
> > >     <dir name="netlogon" href="netlogon/" />
> > >     <dir name="tools" href="tools/" />
> > >   </index>
> > > </svn>
> > >
> > > Summary for: http://localhost/svn/
> > > Connection: Close:     1  (1.0/sec)
> > >       Total Bytes: 1,232  (1232.0/sec)
> > >        Total Docs:     1  (1.0/sec)
> > >       Unique URLs:     1  (1.0/sec)
> > >
> > > Thanks for any pointer,
> > >
> > > Brian
> > >
> > >
> > >
> > >
> >
> ____________________________________________________________________________________
> > > Now that's room service!  Choose from over 150,000
> > hotels
> > > in 45,000 destinations on Yahoo! Travel to find
> > your fit.
> > > http://farechase.yahoo.com/promo-generic-14795097
> > >
> >
> >
> > --
> > Bookmark  http://kayakfun.info/salsagids/  voor de
> > beste salsafeestjes!
> >
>
>
>
>
> ____________________________________________________________________________________
> TV dinner still cooling?
> Check out "Tonight's Picks" on Yahoo! TV.
> http://tv.yahoo.com/
>


-- 
Bookmark  http://kayakfun.info/salsagids/  voor de beste salsafeestjes!
Received on Tue Jan 16 08:20:26 2007