Skip to main content.
home | support | download

Back to List Archive

Re: following xml relative links with spider.pl

From: dennis lastor <dennis.lastor(at)not-real.gmail.com>
Date: Tue Jan 16 2007 - 21:08:50 GMT
Can someone please remove me from this list?  I have tried to un-subscribe
several times wihtout any luck.

Thanks!
Dennis



On 1/16/07, Cas Tuyn <cas.tuyn@gmail.com> wrote:
>
> Brian,
>
> OK, I overlooked you are using XML. We index html, pdf and doc, no
> xml, and they all contain real hyperlinks.
>
> We use the way of working that anything gets converted to HTML before
> being offered to the indexer. Maybe this means that any links are also
> converted to HTML links via your style sheet and then followed, but I
> doubt it as spidering and content indexing are two separate processes.
> Maybe the usual experts can confirm this?
>
> You may need to convert the XML to HTML first, and index that HTML
> website.
>
> Regards,
>
> Cas
>
>
> On 1/16/07, Brian Ling <brian_ling_gandj@yahoo.com> wrote:
> > Hi Cas,
> >
> > I don't want to follow the subversion.tigris.org link
> > just the ones from the subversion dirs that are in the
> > form:
> >
> > <dir name="dtupdates" href="dtupdates/" />
> >
> > As shown in the spider.pl output at the end of the
> > email ;-)
> >
> > Cheers,
> >
> > Brian
> >
> >
> >
> >
> >
> > --- Cas Tuyn <cas.tuyn@gmail.com> wrote:
> >
> > > Brian,
> > >
> > > The spider stays within the start-domain
> > > localhost/svn, otherwise it
> > > could go on and index the whole Internet. There is a
> > > setting
> > > (follow-hosts or something) that allows you to say
> > > that links to
> > > subversion.tigris.org may be followed. Also look at
> > > same-hosts if
> > > these two hosts are actually equal but have a
> > > different domain (like
> > > www.tigirs.org and tigris.org).
> > >
> > > Regards,
> > >
> > > Cas
> > >
> > >
> > > On 1/16/07, Brian Ling <brian_ling_gandj@yahoo.com>
> > > wrote:
> > > > Hi all,
> > > >
> > > > I've just started using swish-e so sorry if this
> > > is a
> > > > bit newbie.
> > > >
> > > > I want to index a subversion repository via it's
> > > > web/apache front end, but I can't seem to get
> > > > spider.pl to follow the links in the default
> > > > subversion output.
> > > >
> > > > I'm calling the spider directly with:
> > > > /usr/local/lib/swish-e/spider.pl ./spider.conf it
> > > > finds and outputs the main subversion page (output
> > > at
> > > > end of mail) but doesn't follow any of the links.
> > > > Everything appeared to install OK. I'm on OS X
> > > 10.4.8
> > > > What am I missing?
> > > >
> > > > spider.conf:
> > > >     @servers = (
> > > >         {
> > > >                 email       => 'test@test.co.uk',
> > > >                 base_url    =>
> > > > 'http://localhost/svn/',
> > > >                 same_hosts  => [ '127.0.0.1' ],
> > > >                 use_default_config  => 1,
> > > >                 link_tags   => [qw/ a frame dir
> > > /],
> > > >         },
> > > >     );
> > > >     1;
> > > >
> > > > output from spider.pl:
> > > >
> > > > /usr/local/lib/swish-e/spider.pl: Reading
> > > parameters
> > > > from './spider.conf'
> > > > Path-Name: http://localhost/svn/
> > > > Content-Length: 1232
> > > > Document-Type: xml*
> > > >
> > > > <?xml version="1.0"?>
> > > > <?xml-stylesheet type="text/xsl"
> > > > href="/xslt/svnindex.xsl"?>
> > > > <!DOCTYPE svn [
> > > >   <!ELEMENT svn   (index)>
> > > >   <!ATTLIST svn   version CDATA #REQUIRED
> > > >                   href    CDATA #REQUIRED>
> > > >   <!ELEMENT index (updir?, (file | dir)*)>
> > > >   <!ATTLIST index name    CDATA #IMPLIED
> > > >                   path    CDATA #IMPLIED
> > > >                   rev     CDATA #IMPLIED>
> > > >   <!ELEMENT updir EMPTY>
> > > >   <!ELEMENT file  EMPTY>
> > > >   <!ATTLIST file  name    CDATA #REQUIRED
> > > >                   href    CDATA #REQUIRED>
> > > >   <!ELEMENT dir   EMPTY>
> > > >   <!ATTLIST dir   name    CDATA #REQUIRED
> > > >                   href    CDATA #REQUIRED>
> > > > ]>
> > > > <svn version="1.3.0 (r17949)"
> > > >      href="http://subversion.tigris.org/">
> > > >   <index rev="170" path="/">
> > > >     <dir name="SubversionNotes"
> > > > href="SubversionNotes/" />
> > > >     <dir name="altirsCustomInventory"
> > > > href="altirsCustomInventory/" />
> > > >     <dir name="appsMan" href="appsMan/" />
> > > >     <dir name="artwork" href="artwork/" />
> > > >     <dir name="bootDVD-CD" href="bootDVD-CD/" />
> > > >     <dir name="docs" href="docs/" />
> > > >     <dir name="dtupdates" href="dtupdates/" />
> > > >     <dir name="localMachine" href="localMachine/"
> > > />
> > > >     <dir name="netlogon" href="netlogon/" />
> > > >     <dir name="tools" href="tools/" />
> > > >   </index>
> > > > </svn>
> > > >
> > > > Summary for: http://localhost/svn/
> > > > Connection: Close:     1  (1.0/sec)
> > > >       Total Bytes: 1,232  (1232.0/sec)
> > > >        Total Docs:     1  (1.0/sec)
> > > >       Unique URLs:     1  (1.0/sec)
> > > >
> > > > Thanks for any pointer,
> > > >
> > > > Brian
> > > >
> > > >
> > > >
> > > >
> > >
> >
> ____________________________________________________________________________________
> > > > Now that's room service!  Choose from over 150,000
> > > hotels
> > > > in 45,000 destinations on Yahoo! Travel to find
> > > your fit.
> > > > http://farechase.yahoo.com/promo-generic-14795097
> > > >
> > >
> > >
> > > --
> > > Bookmark  http://kayakfun.info/salsagids/  voor de
> > > beste salsafeestjes!
> > >
> >
> >
> >
> >
> >
> ____________________________________________________________________________________
> > TV dinner still cooling?
> > Check out "Tonight's Picks" on Yahoo! TV.
> > http://tv.yahoo.com/
> >
>
>
> --
> Bookmark  http://kayakfun.info/salsagids/  voor de beste salsafeestjes!
>



*********************************************************************
Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
*********************************************************************
Received on Tue Jan 16 13:08:55 2007