>
> > Anyway, I do have a related problem that maybe you can explain. I need to
> > retrieve on metadata imbedded in PDFs. Adobe uses Dublin Core tags
> > (dc:description, dc:title, dc:creator). I can't get swish-e to recognize
> > these as metanames (whether these are in PDFs or in HTML).
>
>$ cat c
>MetaNames dc:description
>
>$ cat 1.html
>hello
>$ swish-e -c c -i 1.html -v0 -T indexed_words
> Adding:[1:swishdefault(1)] 'b' Pos:2 Stuct:0x7 ( HEAD TITLE FILE )
> Adding:[1:swishdefault(1)] 'title' Pos:3 Stuct:0x7 ( HEAD TITLE
> FILE )
> Adding:[1:dc:description(10)] 'foo' Pos:6 Stuct:0x85 ( META HEAD
> FILE )
> Adding:[1:swishdefault(1)] 'hello' Pos:9 Stuct:0x9 ( BODY FILE )
>
>$ swish-e -w dc:description=foo
># SWISH format: 2.4.1
># Search words: dc:description=foo
># Removed stopwords:
># Number of hits: 1
># Search time: 0.001 seconds
># Run time: 0.043 seconds
>1000 1.html "<b>title" 125
>.
>
>
> > Warning: Substituted possible embedded null character(s) in file
> > '/home/hul/htdocs/ois/systems/aleph/docs/test/serial_claiming_in_Aleph.pdf'
>
>Looks like you are not filtering the pdf files.
Oops. When I truncated the config file for testing, I dropped the filtering
directive. But it is there in the full config file:
FileFilter .pdf /usr/local/apache/swish/filter-bin/_pdf2html.pl
and I get same non result when trying to retrieve by dc:description. I
copied the whole session below. So maybe there is a problem with the pdf
filter? --julie
sylvia{julie}15: cat metadata3.conf
# DIRECTIVES COMMON to HTTP and FILESYSTEM METHODS
###################################################
IndexDir /home/hul/htdocs/ois/systems/aleph/docs/test/
# For the FileSystem Method:
# This is a space-separated list of files and
# directories you want indexed. You can specify
# more than one of these directives.
#
# For the HTTP Method:
# Use the URL's from which you want the spidering
# to begin.
# NOTE: use hmtl files rather than directories
# for this method.
IndexFile /usr/local/apache/swish-indexes/metadata3.index
# This is what the generated index file will be.
IndexName "Aleph document index"
IndexDescription "Index of Aleph staff documentation"
#IndexPointer "http://sunsite/~ghill/swish/index.html"
#IndexAdmin "Giulia Hill, (ghill@library.berkeley.edu)"
# Extra information you can include in the index file.
MetaNames dc:description title creator
# List of all the meta names used in the file to index, must be on one line.
# If no metanames DO NOT deleted the line.
PropertyNames dc:description title creator
IndexReport 3
# This is how detailed you want reporting. You can specify numbers
# 0 to 3 - 0 is totally silent, 3 is the most verbose.
FollowSymLinks yes
# Put "yes" to follow symbolic links in indexing, else "no".
ReplaceRules remove "/home/hul/htdocs/"
#ReplaceRules replace "[a-z_0-9]*_m.*\.html" "index.html"
#ReplaceRules replace "/home/oisprivate/htdocs/" "/"
# ReplaceRules allow you to make changes to file pathnames
# before they're indexed. This directive uses C library
# regex.h regular expressions.
# NOTE: do not use replace <string> "" to remove a string,
# use remove <string> instead - you might get a core dump otherwise.
#MinWordLimit 5
# Set the minimum length of an indexable word. Every shorter word
# will not be indexed.
# Commenting out the line will give the defaults
#MaxWordLimit 5
# Set the maximum length of an indexable word. Every longer word
# will not be indexed.
# Commenting out the line will give the defaults
#WordCharacters abcdefghijklmnopqrstuvwxyz\&#;0123456789.@|,-'"[](~!@$%^{}_+?
# WORDCHARS is a string of characters which SWISH permits to
# be in words. Any strings which do not include these characters
# will not be indexed. You can choose from any character in
# the following string:
#
# abcdefghijklmnopqrstuvwxyz0123456789_\|/-+=?!@$%^'"`~,.[]{}()
#
# Note that if you omit "0123456789&#;" you will not be able to
# index HTML entities. DO NOT use the asterisk (*), lesser than
# and greater than signs (<), (>), or colon (:).
#
# Including any of these four characters may cause funny things to happen.
# NOTE: Do not escape \ nor " and they cannot be the first letter in the string
# Commenting out the line will give the defaults
#BeginCharacters m"
# Of the characters that you decide can go into words, this is
# a list of characters that words can begin with. It should be
# a subset of (or equal to) WordCharacters
# Same rule of syntax as for WordCharacters
#EndCharacters \"\
# Of the characters that you decide can go into words, this is
# a list of characters that words can begin with. It should be
# a subset of (or equal to) WordCharacters
# Same rule of syntax as for WordCharacters
#IgnoreLastChar
# Array that contains the char that, if considered valid in the middle of
# a word need to be disreguarded when at the end. It is important to also
# set the given char's in the ENDCHARS array, otherwise the word will not
# be indexed because considered invalid.
# Commenting out the line will give the defaults
# NOTE: if " is the first char in the string it needs to be escaped with \
# Do not escape otherwise
#IgnoreFirstChar
# Array that contains the char that, if considered valid in the middle of
# a word need to be disreguarded when at the beginning. This was to solve
# the problem of parenthesis when there is no space between ( and the
# beginning of the word.
# Remember to add the char's to the BEGINCHARS list also.
# Commenting out the line will give the defaults
# NOTE: if " is the first char in the string it needs to be escaped with \
# Do not escape otherwise
IgnoreLimit 50 1000
# This automatically omits words that appear too often in the files
# (these words are called stopwords). Specify a whole percentage
# and a number, such as "80 256". This omits words that occur in
# over 80% of the files and appear in over 256 files. Comment out
# to turn of auto-stopwording.
#IgnoreWords SwishDefault
# The IgnoreWords option allows you to specify words to ignore.
# Comment out for no stopwords; the word "SwishDefault" will
# include a list of default stopwords. Words should be separated by spaces
# and may span multiple directives.
IndexComments 0
# This option allows the user decide if to index the comments in the files
# default is 1. Set to 0 if comment indexing is not required.
##################################
# DIRECTIVES for FILESYSTEMS ONLY
# Comment out if using HTTP
###################################
IndexOnly .html .pdf
# Only files with these suffixes will be indexed.
NoContents .gif .xbm .au .mov .mpg .ps
# Files with these suffixes will not have their contents indexed -
# only their file names will be indexed.
FileFilter .pdf /usr/local/apache/swish/filter-bin/_pdf2html.pl
FileRules pathname contains BudgRep
#FileRules pathname contains .*dir1
#FileRules filename contains # % ~ .bak .orig .old old.
#FileRules title contains construction example pointers
#FileRules directory contains .htaccess
#FileRules filename is index
# Files matching the above criteria will *not* be indexed.
# The patter matching uses the C library regex.h
################################
# DIRECTIVES for HTTP METHOD ONLY
# Comment out if using FILESYSTEM
##################################
#MaxDepth 5
#(default 5) This defines how many links the spider should
#follow before stopping. A value of 0 configures the spider to
#traverse all links
#Delay 60
#(default 60) The number of seconds to wait between issuing
#requests to a server.
#TmpDir /tmp
#(default /var/tmp) The location of a writeable temp directory
#on your system. The HTTP access method tells the Perl helper to place
#its files there.
#SpiderDirectory /home/ghill/swishRon/src/
#(default ./) The location of the Perl helper
#script. Remember, if you use a relative directory, it is relative to
#your directory when you run SWISH-E, not to the directory that SWISH-E
#is in.
#EquivalentServer http://library.berkeley.edu http://www.lib.berkeley.edu
#EquivalentServer http://sunsite.berkeley.edu:2000 http://sunsite.berkeley.edu
#(default nothing) This allows you to deal with
#servers that use respond to multiple DNS names. Each line should have
#a list of all the method/names that should be considered equivalent.
#If you have multiple directives, each one defines its own set of equivalent
#servers.
sylvia{julie}16: swish-e.new -c metadata3.conf -i
/home/hul/htdocs/ois/systems/aleph/docs/test
Indexing Data Source: "File-System"
Indexing "/home/hul/htdocs/ois/systems/aleph/docs/test"
Checking dir "/home/hul/htdocs/ois/systems/aleph/docs/test"...
acq-approval_plan_titles.pdf - Using DEFAULT (HTML) parser - (292 words)
bestpractice_vendor_code_not_active.pdf - Using DEFAULT (HTML) parser
- (285 words)
cat-rept-xpo-fail.html - Using DEFAULT (HTML) parser - (322 words)
print-setup-circacq.bk.html - Using DEFAULT (HTML) parser - (262 words)
serial_claiming_in_Aleph.pdf - Using DEFAULT (HTML) parser - (1707 words)
cat-authrec-conflicts.pdf - Using DEFAULT (HTML) parser - (373 words)
cres_dataentryguidelines.pdf - Using DEFAULT (HTML) parser - (2736 words)
In dir "/home/hul/htdocs/ois/systems/aleph/docs/test/_notes":
In dir "/home/hul/htdocs/ois/systems/aleph/docs/test/_baks":
In dir "/home/hul/htdocs/ois/systems/aleph/docs/test/_baks/_notes":
Removing very common words...
Getting IgnoreLimit stopwords: Complete
no words removed.
Writing main index...
Sorting words ...
Sorting 1043 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: Complete
Writing word hash: Complete
Writing word data: Complete
1043 unique words indexed.
7 properties sorted.
7 files indexed. 859549 total bytes. 5977 total words.
Elapsed time: 00:00:03 CPU time: 00:00:02
Indexing done!
sylvia{julie}17: swish-e.new -w dc:description=acquisitions -f metadata3.index
# SWISH format: 2.2.3
# Search words: dc:description=acquisitions
err: no results
Received on Fri Jan 16 17:19:13 2004