Skip to main content.
home | support | download

Back to List Archive

Re: Index only htm

From: Bill Moseley <moseley(at)>
Date: Sun Dec 07 2003 - 14:18:17 GMT
On Sun, Dec 07, 2003 at 02:43:43AM -0800, John Angel wrote:
> > > Hi, how to index only directories (/) and html extensions?
> >
> > What are (/) directoires?
> http://site/something/

Oh, directory listings that the server might return if you don't request 
a document and the server is not configured to automatically return 
index.html (if there is one):

    # directory listings (maybe) - return ok if ends in a slash:
    return 1 if $uri->path =~ m[/$];

    # or only index .html or .htm files
    return 1 if $uri->path =~ m[.html?$];

    # eles skip this document
    return 0;

Now, that makes the assumption that 

- .htm and .html are text/html.  And 

- that a path that ends in a slash returns a directory and not 
  an audio file or some other non text/html file

- that there's links actually pointing to those "directories"

You would likely follow up that test with a test_response that checks 
for text/html or text/plain, of course.

BTW -- if you return false from a test_response the connection is 
aborted.  This will break a Keep-Alive connection.  This is because all 
fetches are currently GET requests.  It's been on my todo list for a 
while to have an option to do HEAD requests for test_response tests, 
which would allow the connection to remain open.  That would only make a 
difference on web servers that allowed a large number of keep-alive 
requests before closing the connection.

I use "GET" because there are (were?) some servers that were not 
correctly responding to HEAD requests.

> > > I am not familiar with regexp, should be something like this in
> test_url:
> > >
> > > return 0 if $uri->path =
> /\.(html|htm|shtml|asp|php|txt|phtml|cfm|jsp)$/;
> >
> > return 0 if $uri->path =~ /\.(html|htm|shtml|asp|php|txt|phtml|cfm|jsp)$/;
> >                        ^^
> >
> > That says to return false if the path part of the URL ends in those file
> > extensions -- meaning NOT to index those documents.
> Ok, than it should be:
> return 0 if $uri->path = /\.(html|htm|shtml|asp|php|txt|phtml|cfm|jsp)$/;

No.  That's a syntax error.  And my example was wrong (as I cut-n-pasted 
your exampe):

  return 1 if $uri->path =~ /\.(html|htm|shtml|asp|php|txt|phtml|cfm|jsp)$/;

Or if that's your last test, simply:

  return $uri->path =~ /\.(html|htm|shtml|asp|php|txt|phtml|cfm|jsp)$/;

which returns true if it matches, else false.  

> > > Will that work for queries?
> >
> > What do you mean queries?
> http://site/something/index.php?a=b&b=c

$uri->path in that case contains "index.php" only, and not the query string.
So, yes it will work for that.

Bill Moseley
Received on Sun Dec 7 14:18:20 2003