Skip to main content.
home | support | download

Back to List Archive

Re: New cgi script using metadata

From: John A. Kunze <jak(at)not-real.nlm.nih.gov>
Date: Sat Mar 11 2000 - 05:28:52 GMT
This is very interesting work.  SWISH-E spidering might really benefit
from incorporating support for this.

One application is in allowing sites to "spider themselves" and
leave big metadata files around for other spiders to pick up.
The self-spidering process gives the site a number of features with
regard to external spiders (provided certain spider conventions
are developed and observed):

   (a)  the site can make visible only what they want,

   (b)  the site can make stuff visible in the way they want,

   (c)  the site can make stuff visible that's normally invisible, and

   (d)  the site can can deflect spider load (for well-behaved spiders)

This would all hinge on establishing conventions for pointing
spiders to the metadata files.  Why the spider/robot community
at large hasn't done this sort of thing yet I don't get (one
problem is incremental updates).  Maybe SWISH-E will do it first.

-John

--------
Date: Thu, 9 Mar 2000 15:51:40 -0800 (PST)
From: Steve Thomas <stephen.thomas@adelaide.edu.au>
To: Multiple recipients of list <swish-e@sunsite.berkeley.edu>
Subject: [SWISH-E] New cgi script using metadata

This is a multi-part message in MIME format.
--------------8711CB13FB821DCB0B3BE03C
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Others might be interested in what I've been doing with swish-e and
metadata. Primarily, what I wanted to do was to be able to create files
of metadata, index them with swish-e, but have the search results point
users to the actual files described by the metadata, rather than the
metadata files.

Does that make sense? Here's a diagram:


  /web/some_file.xxx  ---- is described by ----> /meta/some_file.html
           ^                                             |
           |                                             |
      search points to                         is indexed by swish-e
           |                                             |
           |                                             v
  /cgi-bin/swish-cgi.pl   ---- searches ---->      index.swish


I've been able to do that using the PropertyNames option, with Dublin
Core metadata. Specifically, the swish-e config file contains the line:

PropertyNames DC.Identifier DC.Description

Using an extensively hacked copy of swish-cgi.pl, I've used the
dc.identifier value to replace the file path as URL for the item
returned. I'm also displaying the dc.description returned as a
description of the item. If there's no dc data returned, the script just
uses the actual file path from the index.

This all seems to work well enough. If you are interested, there's a
sample index you can search at
http://www.library.adelaide.edu.au/~sthomas/scripts/swish/search.html
and you can see the cgi script at
http://www.library.adelaide.edu.au/~sthomas/scripts/swish/swish-cgi.pl

Sample searches would be "online books" or "electronic resources".

Why?

There are several advantages to this:

1. You can create metadata for non-html files (e.g. pdf, jpeg, exe) to
describe the file, and thus include the file in your index.

2. You can (by indexing only the metadata files) tightly control exactly
which files you index -- the average web site contains lots of pages you
might not want indexed, mixed in with those you do. This makes it easy
to control which are included.

3. You can create metadata files for pages on another web site (e.g. the
online books example will return many references to pages at other
sites). This allows you to present other resources to your user in
addition to your own.


The script is extensible, in that several things have been added as
options in the form. First, the properties can be defined (in any
order), so you can include other things, e.g. dc.suthor. (And not
limited to Dublin Core either.) Second, you can have multiple indexes
defined in an indexpath field. For example, you could use this to let
users select from a range of possible indexes. Third, there's an option
to sort by title instead of rank. See the script for details.


This is so far all done using the standard swish-e code. What I'd really
like to see are some enhancements to swish-e to make all this a bit
simpler:

1. an option to have dc.title replace title in the index;

2. an option to have dc.identifier replace file name in the index;

3. an option to limit indexing to just the HEAD part of an html file
(ie. to limit to metadata).

4. an option to recognise and index rdf data, in the same ways.


Steve
--------------8711CB13FB821DCB0B3BE03C
Content-Type: text/x-vcard; charset=us-ascii;
 name="stephen.thomas.vcf"
Content-Transfer-Encoding: 7bit
Content-Description: Card for Steve Thomas
Content-Disposition: attachment;
 filename="stephen.thomas.vcf"

begin:vcard 
n:Thomas;Steve
tel;cell:040 206 9087
tel;fax:+61 8 8303 4369
tel;work:+61 8 830 35190
x-mozilla-html:FALSE
url:http://www.library.adelaide.edu.au/~sthomas/
org:The University of Adelaide Library
version:2.1
email;internet:stephen.thomas@adelaide.edu.au
title:Senior Systems Analyst
adr;quoted-printable:;;Barr Smith Library=0D=0AThe University of Adelaide;;South Australia;5005;Australia
x-mozilla-cpt:;-15840
fn:Steve Thomas
end:vcard

--------------8711CB13FB821DCB0B3BE03C--
Received on Sat Mar 11 00:29:41 2000