Skip to main content.
home | support | download

Back to List Archive

RE: New cgi script using metadata

From: PropheZine Owner <bob(at)not-real.prophezine.com>
Date: Fri Mar 10 2000 - 16:40:02 GMT
Hi:

I was wondering if you could take a step backwards and explain Metadata.
Not sure what that is.

Bob

-----Original Message-----
From: swish-e@sunsite.berkeley.edu
[mailto:swish-e@sunsite.berkeley.edu]On Behalf Of Steve Thomas
Sent: Thursday, March 09, 2000 6:50 PM
To: Multiple recipients of list
Subject: [SWISH-E] New cgi script using metadata


Others might be interested in what I've been doing with swish-e and
metadata. Primarily, what I wanted to do was to be able to create files
of metadata, index them with swish-e, but have the search results point
users to the actual files described by the metadata, rather than the
metadata files.

Does that make sense? Here's a diagram:


  /web/some_file.xxx  ---- is described by ----> /meta/some_file.html
           ^                                             |
           |                                             |
      search points to                         is indexed by swish-e
           |                                             |
           |                                             v
  /cgi-bin/swish-cgi.pl   ---- searches ---->      index.swish


I've been able to do that using the PropertyNames option, with Dublin
Core metadata. Specifically, the swish-e config file contains the line:

PropertyNames DC.Identifier DC.Description

Using an extensively hacked copy of swish-cgi.pl, I've used the
dc.identifier value to replace the file path as URL for the item
returned. I'm also displaying the dc.description returned as a
description of the item. If there's no dc data returned, the script just
uses the actual file path from the index.

This all seems to work well enough. If you are interested, there's a
sample index you can search at
http://www.library.adelaide.edu.au/~sthomas/scripts/swish/search.html
and you can see the cgi script at
http://www.library.adelaide.edu.au/~sthomas/scripts/swish/swish-cgi.pl

Sample searches would be "online books" or "electronic resources".

Why?

There are several advantages to this:

1. You can create metadata for non-html files (e.g. pdf, jpeg, exe) to
describe the file, and thus include the file in your index.

2. You can (by indexing only the metadata files) tightly control exactly
which files you index -- the average web site contains lots of pages you
might not want indexed, mixed in with those you do. This makes it easy
to control which are included.

3. You can create metadata files for pages on another web site (e.g. the
online books example will return many references to pages at other
sites). This allows you to present other resources to your user in
addition to your own.


The script is extensible, in that several things have been added as
options in the form. First, the properties can be defined (in any
order), so you can include other things, e.g. dc.suthor. (And not
limited to Dublin Core either.) Second, you can have multiple indexes
defined in an indexpath field. For example, you could use this to let
users select from a range of possible indexes. Third, there's an option
to sort by title instead of rank. See the script for details.


This is so far all done using the standard swish-e code. What I'd really
like to see are some enhancements to swish-e to make all this a bit
simpler:

1. an option to have dc.title replace title in the index;

2. an option to have dc.identifier replace file name in the index;

3. an option to limit indexing to just the HEAD part of an html file
(ie. to limit to metadata).

4. an option to recognise and index rdf data, in the same ways.


Steve
Received on Fri Mar 10 11:46:38 2000