Skip to main content.
home | support | download

Back to List Archive

Re: RE: LWP,HTTP and HTML modules

From: Mark Gaulin <gaulin(at)not-real.globalspec.com>
Date: Wed Jan 20 1999 - 14:09:30 GMT
This seems like too much of a debate for such a simple feature...

Allowing the person who configures swish to specify *some* file
extensions that he does not want indexed/spidered is just feature.
If it we implemented not as "file extensions to avoid" but "regex
expressions to avoid" then no one would be arguing... it is simply 
a tool to use when the situation calls for it.

We all know that some of the big search engines (altavista, infoseek, 
etc, etc) do not try to index pages that have a certain "look" to them...
some skip "cgi" or ".exe", others anything with a "?" or "&" in them. 
I think I could say fairly certainly that AltaVista is not trying to download
any of the gif files from my site.  It does not know *for sure* if those
are in fact image files, but it doesn't care either.  What to skip is
determined by the people who built the index.

Now that I look at it this way, adding a feature to support regex patterns 
to be avoided sounds like a good idea, just for "completeness". (Ok,
"completeness" may not be a goal of swishe, but it doesn't hurt to
frame it that way.. does it?)

Having said all of that, I am not saying that someone specific "must go
implement this right now, or else!"...  I'm just saying that this feature is
not wrong or a sign of stupidity, and in some cases, is highly desirable.

	Mark

At 04:58 AM 1/20/99 -0800, David Norris wrote:
>Here are a few(?) good reasons why one can't assume things, one way or
>another, based on file extensions:
>
>Apache only knows the MIME type of a file based on what you, the server
>administrator, put in the mime.types, magic, and various other config files.
>If you fail to define a MIME type for a file, Apache doesn't have a clue and
>calls it whatever you defined, in the httpd.conf, as the default MIME type,
>usually text/plain.  So, configure your server correctly.  Incorrect MIME
>types break everything, not just the SWISH-E spider.  Well, Internet
>Explorer for Windows generally ignores MIME types, so it won't break that.
>
>Now, if you don't know this (which apparently someone doesn't), Unix systems
>normally get the file type based on file byte-code headers.  Unix systems
>have a magic file to provide file-type to byte-code mapping.  From a
>terminal on a Unix system, type 'file /usr/sbin/httpd'  You should get a
>detailed description of the type of that file.  On my Linux 2.0 system 'file
>/usr/sbin/httpd' returns "ELF 32-bit LSB executable, Intel 80386, version 1,
>dynamically linked, and stripped."  Hmmm, it doesn't do that based on
>extension.  It reads the byte-code headers embedded at the beginning of
>every file, which form the basis of the various file types.  Apache easily
>does this, as well.  File extensions are exactly squat on Unix.  MacOS works
>the same way.  File extensions still exist to make it easy to share stuff
>with Windows users.  Everyone else on the planet doesn't need them.  Many
>people use them as a quick and dirty way to specify the MIME type of a file
>for which they do not have a byte-code pattern mapping.  Other's just don't
>know any better.  The rest are using Windows.
>
>http://www.apache.org/docs/mod/mod_mime_magic.html
>
>You can override MIME, as mentioned, in various locations.  The Forcetype
>directive would rarely need to be used on a properly configured system.
>Perhaps, if you wanted to force a script handler to parse a file extension
>it normally wouldn't.  Forcing PHP3, which normally uses .php3, to handle a
>file with a .html extension would be an example of this.
>
>The file extension is almost completely irrelevant unless you are on
>Windows.  On 32-bit Windows it is only relevant because of the way Windows
>HTTP servers are written.  HTTP servers don't have to follow the rules of
>the OS regarding much of anything.  For instance, a not-so-unusual Apache
>configuration might result in this:
>/www/share/index.html.gz.en
>http://localhost/index
>http://localhost/index.html
>http://localhost/index.html.gz
>http://localhost/index.html.gz.en
>
>These URLs all point to the same location in the file system.
>
>This file is the English version of a gzipped HTML file.  This file has the
>MIME type of application/x-compressed-gzip.  However, it might be called as
>/, /index, /index.html, etc over HTTP.  Assuming it is a text/html file,
>based on presence of a .html extension, would be a disaster.
>
>What about a URL that doesn't exist in the filesystem.  For instance:
>http://localhost/sports/football/scores/11-Jan-1999/
>
>Might refer to a handler called sports which is selecting football scores
>for January 11 1999 from an SQL source.  How do you determine the type of
>file by its extension?  I know, then you check the MIME.  That sounds
>perfectly logical on the surface.  But, it is fundamentally flawed in the
>real world.
>
>One has to understand that you can't assume anything with HTTP.  That's why
>we have standard headers and responses defined in the HTTP specs.  I would
>believe the server's Content-Type headers over any guessing based on where
>periods lie in the URL.  If you can't make your server send the correct
>headers, then you should either fix it or hack up the script yourself.  If
>you're server doesn't support HTTP correctly, fire it like a bad employee.
>A broken server does more damage than good.
>
>Just some food for thought in the great MIME debate.
>
>People are content with what they have, until they realize what they don't
>have.  Thus exists Windows.
>
>,David Norris
>
>World Wide Web - http://www.geocities.com/CapeCanaveral/Lab/1652/
>Page via mail - 412039@pager.mirabilis.com
>ICQ Universal Internet Number - 412039
>E-Mail - kg9ae@geocities.com
> 
Received on Wed Jan 20 06:06:23 1999