Skip to main content.
home | support | download

Back to List Archive

filtered filenames

From: Bill Conlon <bill(at)not-real.tothept.com>
Date: Tue Nov 30 2004 - 01:25:42 GMT
when filtering, spider.pl extracts the file name from the uri:

         my $doc = $filter->convert(
             document     => $content_ref,
             name         => $response->base,
             content_type => $content_type,
         );

This works fine when the file is served from the file system, but not 
when served out of a database, where the filename is not present in the 
uri, but instead in the Content-Disposition header.  Here's an example 
header output from swish-e.

----HEADERS for http://oakhill.tothept.com/viewdoc.taf?_uid1=71 ---
Connection: close
Date: Sun, 28 Nov 2004 12:09:49 GMT
Accept-Ranges: bytes
Server: Apache/2.0.48 (Unix)
Content-Length: 123392
Content-Type: application/msword
Last-Modified: 2004-10-27 13:11:13
Client-Date: Sun, 28 Nov 2004 12:09:49 GMT
Client-Peer: 66.201.42.33:80
Client-Response-Num: 1
Content-Disposition: inline; filename=test.doc
-----END HEADERS----

In this case the document name will be stored in the index as 
'viewdoc.taf?_uid1=71' instead of test.doc.  But if the same url is 
viewed in a browser, the file will be downloaded and named test.doc.  
Does it make more sense to modify spider.pl to test for the existence 
of a filename in the Content-Disposition header or do this as part of 
filtering?
Received on Mon Nov 29 17:25:43 2004