Skip to main content.
home | support | download

Back to List Archive

Re: Parse Error PDF -> HTML with metatag "keywords"

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Mar 22 2005 - 14:21:32 GMT
On Tue, Mar 22, 2005 at 01:40:46AM -0800, Scheermann Leonard wrote:
> OK, we should let write keywors in one line (no multilines) - it was no
> prblem.

Yes, from google searches it does not seem that the multi-line
keywords are well supported.  Plus, it was not clear to me from the
PDF specification if multi-line Info entries are even allowed.


> The problem is that the meta tag "keywords" is parsed as ">eta
> name="keywords" (see below)!

> ">eta name="keywords" content="F÷rderprogramm LEADER+

That has to be the way your terminal is displaying the text -- maybe
there is an extra \r left in the tag?

> Swish-e seems to index the key words ('F÷rderprogramm' and 'LEADER+'):

Not index, but parse, yes.  used -T indexed_words to see what
actually gets indexed.

> swishe@local:~/swish-e/bin> swish-e -T index_words -S fs
> -c /home/swishe/swish-e/conf/swish.fs.kr.conf
> Indexing Data Source: "File-System"
                         ^^^^^^^^^^^

Notice:

> Indexing "/srv/www/htdocs/krunet"
> 
> Checking dir "/srv/www/htdocs/krunet"...
>   leer.pdf - Using HTML2 parser - White-space found word
> 'http://localhost/krunet/keywords.pdf'
> White-space found word 'Path-Name:'
                        ^^^^^^^^^^^^^
> White-space found word '/srv/www/htdocs/krunet/keywords.pdf'
> White-space found word 'Content-Length:'
                          ^^^^^^^^^^^^^^^

Looks like you are indexing the output from the filter as a plain
text file ( -S fs ).  But the filter outputs a *header* before the
content.  So you should be indexing with -S prog instead.

> My config file is below:
> FileFilter .pdf /home/swishe/swish-e/lib/swish-e/DirTree.pl

No, DirTree.pl is not a "FileFilter".

"FileFilter" works by passing the filter program the content and the
filter program reads the file name passed by swish and converts it.

DirTree.pl (as it comes with the distribution) is a replacement for
the -S fs input method of swish.  It scans a directory tree fetching
files and possibly filtering them.

If you ran 

    $ /home/swishe/swish-e/lib/swish-e/DirTree.pl > out.txt

out.txt would contain all the files fetched (and filtered) by
DirTree.pl.  Each file in out.txt is preceded by a header giving its
name, length, and document type.

You could then index this by:

   $ swish-e -c swish.config -S prog -i stdin < out.txt

A more common way might be as a pipe:

  $ ./DirTree.pl | swish-e -c swish.config -S prog -i stdin

There's actually a script "filter-bin/swish_filter.pl" that can be
used as a "FileFilter" that uses "SWISH::Filter to filter any
content (normally with -S fs), but I think DirTree.pl with -S prog is
a better way to go.  (Even though swish -S fs mode has lots of config
options for selecting what files are indexed, you can do much more in
Perl, and using FileFilter with swish_filter.pl requires loading the
SWISH::Filter perl modules for every document where DirTree.pl only
loads them one -- much faster).

I would also recommend starting with a very simple swish.config file
and then add things as needed.



> UndefinedMetaTags ignore
>    # By default, undefined meta names are indexed as plain text
>    # This feature can change this behaviour.  Here we say
>    # don't index text in metatags unless defined in MetaNames
> 
> MetaNames automatic
>    # MetaNames first author
>    # List of all the meta names used in the file to index, must be on
> one line.
>    # If no metanames DO NOT deleted the line.
>    # New in 2.0 -> automatic option will extract metanames dynamically

MetaNames "automatic" doesn't do anything, IIRC, unless you really
have a metaname called "automatic".


-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Tue Mar 22 06:21:39 2005