On Wed, Jun 18, 2003 at 01:27:58PM -0700, Cleveland@mail.winnefox.org wrote:
> > When in doubt... test!
> >
> > moseley(at)not-real.bumby:~/apache$ GET http://localhost/apache/noindex.html
>
> > http://localhost/apache/noindex.html -T indexed_words -v0
>
> It all works right. Until you add in a <a href> tag pointing to another
> file. What it does is, it'll skip the word, but still follow that link
> and also index the words in the file it's linked to. I remove the link,
> and it ignores that other file. Is there something I'm missing?
Yes, you are confusing the function of swish-e vs. the spider. The
spider decides what files to send to swish-e (and thus what
links to follow). The "noindex" just tells swish-e to ignore *indexing*
the content between the tags. In other words, that noindex tag tells
swish-e not to index the content between the tags, but there's no way
for swish-e to tell the spider to ignore links found in that noindex
section.
If you don't want to index a page then use robots.txt or a meta robots
tag to say don't follow links.
The spider extracts links *before* calling the filter content function,
so you can't use that to remove links. Perhaps the order of processing
could be changed so that you could simply modify the content (e.g.
remove all content between two comments). I'll have to look when I get
back. My wireless connection isn't working well here:
http://www.forwolves.org/ralph/wpages/graphics/little-redfish-lk2.jpg
--
Bill Moseley
moseley@hank.org
Received on Sat Jun 21 14:02:31 2003