Thanks for the comments. Actually, we normally use an indexing script but
for testing I was doing all this from the command line.
Thanks also for pointing out the typo (dc.description versus
dc:description). This is what happens when you do too much testing on too
much caffeine.
Fixed that. Now we are down to the crux of the problem. I think something
is happening in the PDF filtering. When I view PDF metadata from within
Acrobat, the tagging used is definitely "dc:description". But when
MetaNames is set to "dc:description" I still cannot retrieve on this
metaname (but I can see that the metatag'd words are indexed as
swishdefault). I'm now thinking the problem is in the filtering script for
PDF. Perhaps it is filtering out or in some other way mucking with this
meta tag (dc:description). I looked at the output of the filtering script
and I can't find ANY of the descriptive metadata tags that were part of the
PDF. Why I do find are these:
<meta name="author" content="julie">
<meta name="creationdate" content="Fri Jan 16 12:19:33 2004">
and a bunch of similar tags with this syntax. So the filter is spewing out
metadata tags that I don't even see in the metadata source view provided by
Acrobat, in a syntaxt (<meta name= ...> not used in the Acrobat view. And,
it appears that the filter is extracting contents of dc:description and
appending it to the title, 'cause this is what I see in the <head> of the
xml output:
<title>Testing Dublin Core in PDF // olympus mons</title>
"Olympus mons" are the keywords that started out in the metadata tag
<dc:description>olympus mons</dc:description> that I see when viewing
metadata source in Acrobat.
Argh. Does anyone know a way around this? Is there another/better PDF
filter we could use with SWISH-e? Thanks. --julie
At 02:24 PM 1/16/2004, you wrote:
>Hi Julie,
>
>One more comment. Seems you are using a config file from an old version
>of swish-e. It's a common thing to do, it seems.
>
>There's used to be a config file with most if not all the
>options defined -- and some incorrectly. For example, that default
>config contains both IndexOnly and NoContents -- but the NoContents
>includes files that will never be touched because of the IndexOnly.
>
>My suggestion (and that's all it is) is to have a config file with just
>a few things that need to be changed from the default.
>
>So your 160 line config file below could be reduced to:
>
> MetaNames dc.description dc.title dc.creator
> PropertyNames dc.description dc.title dc.creator
> ReplaceRules remove /home/hul/htdocs
> IndexOnly .pdf .html
>
> # Filter PDF
> # See http://swish-e.org/current/docs/INSTALL.html#Filtering_Overview
> # for a possibly faster, and better supported way
> FileFilter .pdf /usr/local/apache/swish/filter-bin/_pdf2html.pl
>
> # Skip these (pathname may match a file -- do you mean dirname?)
> FileRules pathname contains BudgRep
>
>I just think that's easier to manage. But, again, that's just my
>preference.
>
>Then use an indexing script that does something like:
>
> #!/bin/sh
> echo "Indexing Aleph staff documentation"
> swish-e -c /path/to/config \
> -i /home/hul/htdocs/ois/systems/aleph/docs/test/ \
> -f /usr/local/apache/swish-indexes/metadata3.index \
> -v0
>
>And only generates output when there's a problem. You can put the
>paths in the config file if you like (and can override with command line
>options).
>
>--
>Bill Moseley
>moseley@hank.org
===============================================================
Julie Wetherill
Office for Information System
Harvard University Library
1280 Massachusetts Ave., Suite 404
Cambridge, MA 02138
ph 617.495.3724
fx 617.495.0491
julie_wetherill@harvard.edu
"Outside of a dog, a book is a man's best friend.
Inside of a dog, it's too dark to read." --Groucho Marx
*********************************************************************
Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
*********************************************************************
Received on Fri Jan 16 20:36:33 2004