Skip to main content.
home | support | download

Back to List Archive

Re: [SWISH-E:313] Re: Indexing file context?

From: Roy Tennant <rtennant(at)not-real.library.berkeley.edu>
Date: Fri Jun 05 1998 - 14:16:40 GMT
On Thu, 4 Jun 1998, Dan Brickley wrote:

> On Thu, 4 Jun 1998, Roy Tennant wrote:
> > what I'm talking about. I'd much prefer to put abstracts in my files and
> > fetch those. 
> 
> Me too. Has anybody done any work along these lines? eg. building
> something like a cache of extracted metadata from the indexed pages, so
> that result-sets could include Title/Description/Keywords/Subject etc.
> based on contents of META tags? Extracting these manually each time a
> query occurs would presumably be a little inefficient.

I understand and support the desire to be efficient as long as it
doesn't prevent useful projects from happening. For example, I've cobbled
together some terribly inefficient projects using SWISH-E and Perl that
nonetheless are effective. For example, in one I've used SWISH-E
to index around 18,000 files that only consist of META tags and their
contents. For example:

<LINK REL="SCHEMA.dc" HREF="http://purl.org/metadata/dublin_core">
<META NAME="DC.publisher" CONTENT="The Library, University of California,
Berkeley">
<META NAME="DC.creator" CONTENT="National Information Standards
Organization (NISO) ">
<META NAME="DC.title" CONTENT="Serial Item and Contribution Identifier
Standard">
<META NAME="DC.identifier" CONTENT="http://sunsite.Berkeley.EDU/SICI/">
<META NAME="DC.description" CONTENT="The SICI standard (ANSI/NISO
Z39.560-1996, Version 2)provides an extensible mechanism for the unique
identification of either an issue of a serial title or a contribution
(e.g., article) contained within a serial, regardless of the distribution
medium (paper, electronic, microform,etc.). ">
<META NAME="DC.type" CONTENT="text">
<META NAME="DC.language" CONTENT="eng">
<META NAME="DC.date" CONTENT="1997">
<META NAME="DC.relation" CONTENT="Online version of the paper document
published by the National Information Standards Organization (NISO) .">
<META NAME="DC.rights" CONTENT="Copyright (c) 1997 ANSI/NISO.">
<META NAME="DC.format" CONTENT="text/html">
<META NAME="DC.subject" CONTENT="standard">
<META NAME="DC.subject" CONTENT="identifiers">

This means when a search is performed, I must *parse every hit* in order
to extract the information for display. And this is using Perl, not a
compiled language. To see the response time for a search that retrieves
364 items go to (or do your own search at
http://sunsite.berkeley.edu/ImageCatalog/ ): 

http://sunsite.Berkeley.EDU/cgi-bin/searchimages.pl?keyword=mission&database=catalog-z.swish&display=briefplus&displaynum=10&results=0

You will find that besides the time it takes for the Web client to
download the images, it is the initial search that takes the longest chunk
of time, since I write out a temporary file that speeds up the response
when another page of results is requested. But even so, I think you will
find the response time to be decent, particularly for a prototype.

So as much as I may work to increase efficiencies, I too often run into
those who would not do something *at all* because it is too inefficient.
In my opinion, good enough is often just that -- good enough. And CPU
cycles don't do you one bit of good until you burn them.
Roy Tennant
Received on Fri Jun 5 07:24:16 1998