On Thu, Jun 12, 2003 at 01:43:40PM -0700, Jody Cleveland wrote:
> Hello,
>
> I've got a site I spider using swish-e. There are certain portions of
> their pages they don't want spidered. For a site I've got local on that
> machine, I just ad a <!-- Swishcommand noindex --> before the chunk I
> don't want indexed. Then I pick up again with <!-- Swishcommand index
> -->. That doesn't seem to work when spidering. This person is putting
> those tags before and after certain links in pages they don't want
> spidered. Is there a different line I should have her put in there?
When in doubt... test!
moseley(at)not-real.bumby:~/apache$ GET http://localhost/apache/noindex.html
<html>
<head><title>noindex</title></head>
<body>
indexthisword
<!-- Swishcommand noindex -->
butnotthisword
<!-- Swishcommand index -->
thisisok
</body>
</html>
moseley@bumby:~/apache$ swish-e -S http -i
http://localhost/apache/noindex.html -T indexed_words -v0
Adding:[1:swishdefault(1)] 'noindex' Pos:2 Stuct:0x7 ( HEAD TITLE FILE )
Adding:[1:swishdefault(1)] 'indexthisword' Pos:5 Stuct:0x9 ( BODY FILE )
Adding:[1:swishdefault(1)] 'thisisok' Pos:6 Stuct:0x9 ( BODY FILE )
moseley(at)not-real.bumby:~/apache$ /usr/local/lib/swish-e/spider.pl default http://localhost/apache/noindex.html | swish-e -S prog -i stdin -T indexed_words -v0
/usr/local/lib/swish-e/spider.pl: Reading parameters from 'default'
Summary for: http://localhost/apache/noindex.html
Total Bytes: 163 (163.0/sec)
Total Docs: 1 (1.0/sec)
Unique URLs: 1 (1.0/sec)
Adding:[1:swishdefault(1)] 'noindex' Pos:2 Stuct:0x7 ( HEAD TITLE FILE )
Adding:[1:swishdefault(1)] 'indexthisword' Pos:5 Stuct:0x9 ( BODY FILE )
Adding:[1:swishdefault(1)] 'thisisok' Pos:6 Stuct:0x9 ( BODY FILE )
Humm -- I think the word position needs to be incremented. Otherwise
you could get a phrase match across that comment....
--
Bill Moseley
moseley@hank.org
Received on Thu Jun 12 23:35:28 2003