Skip to main content.
home | support | download

Back to List Archive

Re: Novice question: unknown MetaNames error

From: Julie Wetherill <julie(at)not-real.gentoo.harvard.edu>
Date: Fri Jan 16 2004 - 20:35:00 GMT
Thanks for the comments. Actually, we normally use an indexing script but 
for testing I was doing all this from the command line.

Thanks also for pointing out the typo (dc.description versus 
dc:description). This is what happens when you do too much testing on too 
much caffeine.

Fixed that. Now we are down to the crux of the problem. I think something 
is happening in the PDF filtering. When I view PDF metadata from within 
Acrobat, the tagging used is definitely "dc:description". But when 
MetaNames is set to "dc:description" I still cannot retrieve on this 
metaname (but I can see that the metatag'd words are indexed as 
swishdefault). I'm now thinking the problem is in the filtering script for 
PDF. Perhaps it is filtering out or in some other way mucking with this 
meta tag (dc:description). I looked at the output of the filtering script 
and I can't find ANY of the descriptive metadata tags that were part of the 
PDF. Why I do find are these:

<meta name="author" content="julie">
<meta name="creationdate" content="Fri Jan 16 12:19:33 2004">

and a bunch of similar tags with this syntax. So the filter is spewing out 
metadata tags that I don't even see in the metadata source view provided by 
Acrobat, in a syntaxt (<meta name= ...> not used in the Acrobat view. And, 
it appears that the filter is extracting contents of dc:description and 
appending it to the title, 'cause this is what I see in the <head> of the 
xml output:

<title>Testing Dublin Core in PDF // olympus mons</title>

"Olympus mons" are the keywords that started out in the metadata tag 
<dc:description>olympus mons</dc:description> that I see when viewing 
metadata source in Acrobat.

Argh. Does anyone know a way around this? Is there another/better PDF 
filter we could use with SWISH-e? Thanks.  --julie

At 02:24 PM 1/16/2004, you wrote:
>Hi Julie,
>
>One more comment.  Seems you are using a config file from an old version
>of swish-e.  It's a common thing to do, it seems.
>
>There's used to be a config file with most if not all the
>options defined -- and some incorrectly.  For example, that default
>config contains both IndexOnly and NoContents -- but the NoContents
>includes files that will never be touched because of the IndexOnly.
>
>My suggestion (and that's all it is) is to have a config file with just
>a few things that need to be changed from the default.
>
>So your 160 line config file below could be reduced to:
>
>     MetaNames dc.description dc.title dc.creator
>     PropertyNames dc.description dc.title dc.creator
>     ReplaceRules remove /home/hul/htdocs
>     IndexOnly .pdf .html
>
>     # Filter PDF
>     # See http://swish-e.org/current/docs/INSTALL.html#Filtering_Overview
>     # for a possibly faster, and better supported way
>     FileFilter .pdf /usr/local/apache/swish/filter-bin/_pdf2html.pl
>
>     # Skip these (pathname may match a file -- do you mean dirname?)
>     FileRules pathname contains BudgRep
>
>I just think that's easier to manage.  But, again, that's just my
>preference.
>
>Then use an indexing script that does something like:
>
>     #!/bin/sh
>     echo "Indexing Aleph staff documentation"
>     swish-e -c /path/to/config \
>             -i /home/hul/htdocs/ois/systems/aleph/docs/test/ \
>             -f /usr/local/apache/swish-indexes/metadata3.index \
>             -v0
>
>And only generates output when there's a problem.  You can put the
>paths in the config file if you like (and can override with command line
>options).
>
>--
>Bill Moseley
>moseley@hank.org


===============================================================
Julie Wetherill
Office for Information System
Harvard University Library
1280 Massachusetts Ave., Suite 404
Cambridge, MA 02138

ph 617.495.3724
fx 617.495.0491
julie_wetherill@harvard.edu

"Outside of a dog, a book is a man's best friend.
    Inside of a dog, it's too dark to read." --Groucho Marx




*********************************************************************
Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
*********************************************************************
Received on Fri Jan 16 20:36:33 2004