On Tue, Mar 22, 2005 at 01:40:46AM -0800, Scheermann Leonard wrote:
> OK, we should let write keywors in one line (no multilines) - it was no
> prblem.
Yes, from google searches it does not seem that the multi-line
keywords are well supported. Plus, it was not clear to me from the
PDF specification if multi-line Info entries are even allowed.
> The problem is that the meta tag "keywords" is parsed as ">eta
> name="keywords" (see below)!
> ">eta name="keywords" content="Förderprogramm LEADER+
That has to be the way your terminal is displaying the text -- maybe
there is an extra \r left in the tag?
> Swish-e seems to index the key words ('Förderprogramm' and 'LEADER+'):
Not index, but parse, yes. used -T indexed_words to see what
actually gets indexed.
> swishe@local:~/swish-e/bin> swish-e -T index_words -S fs
> -c /home/swishe/swish-e/conf/swish.fs.kr.conf
> Indexing Data Source: "File-System"
^^^^^^^^^^^
Notice:
> Indexing "/srv/www/htdocs/krunet"
>
> Checking dir "/srv/www/htdocs/krunet"...
> leer.pdf - Using HTML2 parser - White-space found word
> 'http://localhost/krunet/keywords.pdf'
> White-space found word 'Path-Name:'
^^^^^^^^^^^^^
> White-space found word '/srv/www/htdocs/krunet/keywords.pdf'
> White-space found word 'Content-Length:'
^^^^^^^^^^^^^^^
Looks like you are indexing the output from the filter as a plain
text file ( -S fs ). But the filter outputs a *header* before the
content. So you should be indexing with -S prog instead.
> My config file is below:
> FileFilter .pdf /home/swishe/swish-e/lib/swish-e/DirTree.pl
No, DirTree.pl is not a "FileFilter".
"FileFilter" works by passing the filter program the content and the
filter program reads the file name passed by swish and converts it.
DirTree.pl (as it comes with the distribution) is a replacement for
the -S fs input method of swish. It scans a directory tree fetching
files and possibly filtering them.
If you ran
$ /home/swishe/swish-e/lib/swish-e/DirTree.pl > out.txt
out.txt would contain all the files fetched (and filtered) by
DirTree.pl. Each file in out.txt is preceded by a header giving its
name, length, and document type.
You could then index this by:
$ swish-e -c swish.config -S prog -i stdin < out.txt
A more common way might be as a pipe:
$ ./DirTree.pl | swish-e -c swish.config -S prog -i stdin
There's actually a script "filter-bin/swish_filter.pl" that can be
used as a "FileFilter" that uses "SWISH::Filter to filter any
content (normally with -S fs), but I think DirTree.pl with -S prog is
a better way to go. (Even though swish -S fs mode has lots of config
options for selecting what files are indexed, you can do much more in
Perl, and using FileFilter with swish_filter.pl requires loading the
SWISH::Filter perl modules for every document where DirTree.pl only
loads them one -- much faster).
I would also recommend starting with a very simple swish.config file
and then add things as needed.
> UndefinedMetaTags ignore
> # By default, undefined meta names are indexed as plain text
> # This feature can change this behaviour. Here we say
> # don't index text in metatags unless defined in MetaNames
>
> MetaNames automatic
> # MetaNames first author
> # List of all the meta names used in the file to index, must be on
> one line.
> # If no metanames DO NOT deleted the line.
> # New in 2.0 -> automatic option will extract metanames dynamically
MetaNames "automatic" doesn't do anything, IIRC, unless you really
have a metaname called "automatic".
--
Bill Moseley
moseley@hank.org
Unsubscribe from or help with the swish-e list:
http://swish-e.org/Discussion/
Help with Swish-e:
http://swish-e.org/current/docs
swish-e@sunsite.berkeley.edu
Received on Tue Mar 22 06:21:39 2005