Skip to main content.
home | support | download

Back to List Archive

Re: Probs with xml-marc format

From: Thoreau Lovell <tlovell(at)not-real.sfsu.edu>
Date: Fri Feb 13 2004 - 17:53:46 GMT
Bill / Fritz

You're right, each html/xml doc includes records for multiple journals. I 
think I'll try to write a script to break these up so that there is one 
journal per doc. What I meant when I said swish-e had trouble with the 
xml-marc format is that it would recognize <datafield> and <subfield> as 
elements, but not the xml-marc identifiers such as "022," "245,"210", which 
are ISSN, Title, Alt Title. The problem is that these are <datafield> / 
<subfield> values, which swish-e doesn't seem to be able to distinguish. Do 
you think using MetaNames and UndefinedMetaNames will solve this problem 
once each journal is in an separate file?

In any case it looks like I'm about to get a quick introduction to xml 
transformations!


Thanks again,

Thoreau
At 04:04 PM 2/12/2004 -0800, Bill Moseley wrote:
>On Thu, Feb 12, 2004 at 02:58:12PM -0800, Thoreau Lovell wrote:
>
> > We get a list of Journals for which we have online access to fulltext
> > articles from a vendor in either html or xml. We're talking, say 20 -
> > 40,000 journals. The list is exported as separate docs for each letter of
> > the alphabet, where A--.html has all the journals that start with the
> > letter "A".
>
>Ok, so is there ONE journal entry per *.html file, or does a given html
>file contain more than one entry?
>
> > The problem is how the found set is returned. Searching for American
> > Chiropractor, for instance, tells me that the journal is found in 
> A--.html.
> > But I can't get Swish-e to return any of the more useful data elements:
> > Journal title, ISSN, Coverage, Source, which are all present in the 
> indexed
>
>I hope I'm understanding your problem.
>
>Swish-e indexes single documents.  It sounds like you are trying to feed
>it some xml that contains more than one "document".  Swish-e (currently)
>does not have the feature to split up a multi-record document into the
>individual parts.
>
>What you likely want to do is use swish-e's "-S prog" feature where an
>external program feeds "documents" to swish.  So your external program
>would parse the xml using either a SAX or DOM parser and then formats
>each record into a document and feeds it to swish.
>
>Swish-e doesn't do that now -- it could, I suppose, but since there's so
>many good tools to do the parsing externally that it make more sense to
>use those.
>
>An example of this setup is with the swish-e docs:
>
>   http://swish-e.org/current/docs/searchdoc.html
>
>In this case it's breaking up the source HTML docs into sections and
>indexing them separately.  Search for something like "installation" and
>you can see that you might get more than one result for a given page.
>
> > files. This seems like a situation where the structured nature of XML
> > should be useful, so I've focused on working with XML Docs.
>
>Seems reasonable.  You just need to parse it into chunks.  Do you have a
>favorite language?
>
> > One problem may be that the format the vendor uses is xml-marc, which
> > seems to give Swish-e some trouble. Here's a snippet of what the data
> > looks like:
>
>What does "trouble" mean?
>
> >
> >   <record>
> > <leader>-----nas-a22-----z--4500</leader>
> > -<datafield tag="022" ind1="" ind2="">
> >          <subfield code="a">0194-6536</subfield>
> > </datafield>
> > -<datafield tag="245" ind1="" ind2="4">
> >          <subfield code="a">The American chiropractor</subfield>
> > </datafield>
> > -<datafield tag="210" ind1="" ind2="">
> >          <subfield code="a">AMERICAN CHIROPRACTOR</subfield>
> > </datafield>
> > -<datafield tag="090" ind1="" ind2="">
> >          <subfield code="a">110978978735405</subfield>
> > </datafield>
> > -<datafield tag="866" ind1="" ind2="">
> >          <subfield code="x">Alt-HealthWatch:Full Text</subfield>
> >          <subfield code="a"> Availability: from 1998</subfield>
> > </datafield>
> > </record>
>
>I don't see any problem with that.  If you format as HTML you can effect
>the ranking a bit (i.e. words inside <title> would get ranked higher
>than words in <body>).
>
>
> > I've experimented with XMLClassAttributes and UndefinedXMLAttributes,
> > without much luck.
>
>No, those are more for pulling text out of attributes (and what to do
>with them).
>
>
> > What I'd like is to see is a search result like this:
> >
> >   AMERICAN CHIROPRACTOR (0194-6536)
> >          Alt-HealthWatch:Full Text
> >          Availability: from 1998
>
>There's a few ways to do this, but you could format as:
>
>
>Then use MetaNames to define what fields to search for.  Use
>UndefinedMetaNames to define what to do with meta content that is not
>listed in MetaNames.
>
>And use
>
>    PropertyNames 022 866.a 866.x
>
>to store the text for display on search results.
>
>
>Sure hope I'm answering the right question. ;)
>
>--
>Bill Moseley
>moseley@hank.org

Thoreau Lovell
Digital Systems Design and Development Coordinator
J. Paul Leonard Library, San Francisco State University
415-338-2285 | tlovell@sfsu.edu  
Received on Fri Feb 13 09:53:48 2004