Skip to main content.
home | support | download

Back to List Archive

Re: Re: an exclusion question

From: Mark Gaulin <gaulin(at)not-real.globalspec.com>
Date: Thu Jan 28 1999 - 20:33:27 GMT
Oh, I thought you did something along the lines of the FileRules,
which can test the entire path. My mistake.

This gets me thinking about what we have now and what we might
want for the file system and http methods. Here is a summary of what 
swish can do now:
* IndexOnly can be used right now to limit the scope using regexp but
it currently works only with file system jobs.
* FileRules is used to exclude files by pathname, filename, directory
(somehow that is different than a path), and title using "contains" and
there is 
one more option for "filename is".  These only apply to file system jobs.
* NoContents tests for file and url suffixes. [There is a minor flaw in
the way it is implemented now... if FileSystem is not compiled into swishe
then NoContents will not have any effect for HTTP. Fix: "NoContents" should
be parsed along with the common directives.]

I guess we could just try to implement IndexOnly and FileRules for HTTP
but I can imagine some things that would be hard to specify for the spider.
Ex: I might want to exclude an URL from the index but I still might want
to have it crawled, so I can get to urls that it points to.

I'm thinking that there is a way to use regex to neatly describe urls (and
files) that should be included/excluded in the index and also traversed
or not traversed (recursing into a directory for file system, following links
in an url for HTTP).

Maybe a little later...

	Mark


At 11:32 AM 1/28/99 -0800, Yann Stettler wrote:
>Mark Gaulin wrote:
>> 
>> Yann Stettler <stettler@cohprog.com> has a patch that
>> does just what you are asking. I do not see it on the ftp site
>> so you may want to contact him directly.
>> 
>> (We should get that HTTP NoContents patch posted, yes?)
>
>Hello,
>The NoContents stuff was included in the latest version...
>
>But I don't think that it will work for what Bruce want
>to do :
>
>>>Specifically I'd like it to ignore files of the form *.wwwstat.html.  I
>>>tried adding .wwwstat.html to the NoContents directive but that reduced my
>>>contents to 0 since it excluded all .html files :-).  Other than stuffing
>
>I didn't wanted to do anything fancy so I just put back the
>procedure used for the NoContents that was in the filesystem
>method into the HTTP one. The problem is that there is a
>single function ,isoksuffix(), that is used to check if the
>suffix of a file is listed in the argument of the directive.
>This function is used sometime to check if a suffix is inside
>the list and so the file should be keep (For example for
>"IndexOnly" directive) and sometime to check if the suffix
>is inside the list but the file should be discarded (NoContents).
>
>It may seems similar but interpretation in some cases shouldn't
>be the same. For example, if a file doesn't have any suffix,
>it should be discarded when using "IndexOnlx" but it should
>be kept when using "NoContents"...
>
>In the same way, in the original function, it was assumed
>that only the characters _after_ the last "." are important.
>So when trying to put "NoContents .wwwstat.html", the result
>is the same as if you had put ".html"... 
>Personaly, I consider that it's a bug...
>
>Hmm, I also noticed that there is several optimization that
>could be done in this function... like stoping testing all
>the suffix as soon as we find a matching one.
>
>I guess that I will write a new function and post a patch
>this weekend..
>
>Cheers,
>Yann Stettler
>
>-- 
>-------------------------------------------------------------------
>TheNet - Internet Services AG              CohProg SaRL
>stettler@thenet.ch                         stettler@cohprog.com
>http://www.thenet.ch/                      http://www.cohprog.com/
>                              ---**---
>Anime and Manga Services                   http://www.animanga.com/
> 
Received on Thu Jan 28 12:27:28 1999