Skip to main content.
home | support | download

Back to List Archive

Re: Probs with xml-marc format

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Feb 13 2004 - 00:05:32 GMT
On Thu, Feb 12, 2004 at 02:58:12PM -0800, Thoreau Lovell wrote:

> We get a list of Journals for which we have online access to fulltext 
> articles from a vendor in either html or xml. We're talking, say 20 - 
> 40,000 journals. The list is exported as separate docs for each letter of 
> the alphabet, where A--.html has all the journals that start with the 
> letter "A".

Ok, so is there ONE journal entry per *.html file, or does a given html
file contain more than one entry?

> The problem is how the found set is returned. Searching for American 
> Chiropractor, for instance, tells me that the journal is found in A--.html. 
> But I can't get Swish-e to return any of the more useful data elements: 
> Journal title, ISSN, Coverage, Source, which are all present in the indexed 

I hope I'm understanding your problem.

Swish-e indexes single documents.  It sounds like you are trying to feed
it some xml that contains more than one "document".  Swish-e (currently)
does not have the feature to split up a multi-record document into the
individual parts.

What you likely want to do is use swish-e's "-S prog" feature where an
external program feeds "documents" to swish.  So your external program
would parse the xml using either a SAX or DOM parser and then formats
each record into a document and feeds it to swish.

Swish-e doesn't do that now -- it could, I suppose, but since there's so
many good tools to do the parsing externally that it make more sense to
use those.

An example of this setup is with the swish-e docs:

  http://swish-e.org/current/docs/searchdoc.html

In this case it's breaking up the source HTML docs into sections and
indexing them separately.  Search for something like "installation" and
you can see that you might get more than one result for a given page.

> files. This seems like a situation where the structured nature of XML 
> should be useful, so I've focused on working with XML Docs.

Seems reasonable.  You just need to parse it into chunks.  Do you have a
favorite language?

> One problem may be that the format the vendor uses is xml-marc, which
> seems to give Swish-e some trouble. Here's a snippet of what the data
> looks like:

What does "trouble" mean?

> 
>   <record>
> <leader>-----nas-a22-----z--4500</leader>
> -<datafield tag="022" ind1="" ind2="">
>          <subfield code="a">0194-6536</subfield>
> </datafield>
> -<datafield tag="245" ind1="" ind2="4">
>          <subfield code="a">The American chiropractor</subfield>
> </datafield>
> -<datafield tag="210" ind1="" ind2="">
>          <subfield code="a">AMERICAN CHIROPRACTOR</subfield>
> </datafield>
> -<datafield tag="090" ind1="" ind2="">
>          <subfield code="a">110978978735405</subfield>
> </datafield>
> -<datafield tag="866" ind1="" ind2="">
>          <subfield code="x">Alt-HealthWatch:Full Text</subfield>
>          <subfield code="a"> Availability: from 1998</subfield>
> </datafield>
> </record>

I don't see any problem with that.  If you format as HTML you can effect
the ranking a bit (i.e. words inside <title> would get ranked higher
than words in <body>).


> I've experimented with XMLClassAttributes and UndefinedXMLAttributes, 
> without much luck.

No, those are more for pulling text out of attributes (and what to do
with them).


> What I'd like is to see is a search result like this:
> 
>   AMERICAN CHIROPRACTOR (0194-6536)
>          Alt-HealthWatch:Full Text
>          Availability: from 1998

There's a few ways to do this, but you could format as:

<html><head>
<title>AMERICAN CHIROPRACTOR</title>
<meta name="022" content="0194-6536">
<meta name="866.a" content="Availability: from 1998">
<meta name="866.x" content="Alt-HealthWatch:Full Text">
<body>
</body>
</html>

Then use MetaNames to define what fields to search for.  Use
UndefinedMetaNames to define what to do with meta content that is not
listed in MetaNames.

And use

   PropertyNames 022 866.a 866.x

to store the text for display on search results.


Sure hope I'm answering the right question. ;)

-- 
Bill Moseley
moseley@hank.org
Received on Thu Feb 12 16:05:33 2004