Skip to main content.
home | support | download

Back to List Archive

Re: Re: Parse Error PDF -> HTML with metatag "keywords"

From: Scheermann Leonard <Leonard.Scheermann(at)not-real.DLE-M.Bayern.de>
Date: Tue Mar 22 2005 - 09:41:57 GMT
Hi Bill!

OK, we should let write keywors in one line (no multilines) - it was no
prblem.

The problem is that the meta tag "keywords" is parsed as ">eta
name="keywords" (see below)!

swish-filter-test gives the following out:

swishe@local:~/swish-e/bin> ./swish-filter-test
-content /daten/intranet/krunet/keywords.pdf

Document /daten/intranet/krunet/keywords.pdf was  filtered.
   Document:     /daten/intranet/krunet/keywords.pdf
(/daten/intranet/krunet/keywords.pdf)
   Content-Type: text/html
   Parser type:  HTML*

   >Filter used: SWISH::Filters::Pdf2HTML=HASH(0x845ec40)
( application/pdf -> text/html )
<html>
<head>
<meta name="author" content="Rieser Nachrichten">
<meta name="creationdate" content="Sun Mar  6 18:27:40 2005">
<meta name="encrypted" content="no">
<meta name="file_size" content="21374 bytes">
">eta name="keywords" content="Förderprogramm LEADER+
<meta name="moddate" content="Tue Mar 22 09:38:43 2005">
<meta name="optimized" content="yes">
<meta name="page_size" content="595 x 842 pts (A4)">
<meta name="pages" content="1">
<meta name="pdf_version" content="1.5">
<meta name="producer" content="Acrobat Distiller 6.0.1 (Windows)">
<meta name="subject" content="LEADER+-Projekte sollen durch Netzwerk
verbunden und für alle nutzbar gemacht werden">
<meta name="tagged" content="yes">
<meta name="title" content="Neuer Schwung für Monheimer Alb">
</head>
<body>
<pre>



</pre>
</body>
</html>
swishe@local:~/swish-e/bin>

Swish-e seems to index the key words ('Förderprogramm' and 'LEADER+'):

swishe@local:~/swish-e/bin> swish-e -T index_words -S fs
-c /home/swishe/swish-e/conf/swish.fs.kr.conf
Indexing Data Source: "File-System"
Indexing "/srv/www/htdocs/krunet"

Checking dir "/srv/www/htdocs/krunet"...
  leer.pdf - Using HTML2 parser - White-space found word
'http://localhost/krunet/keywords.pdf'
White-space found word 'Path-Name:'
White-space found word '/srv/www/htdocs/krunet/keywords.pdf'
White-space found word 'Content-Length:'
White-space found word '861'
White-space found word 'Last-Mtime:'
White-space found word '1111480257'
White-space found word 'Document-Type:'
White-space found word 'HTML*'
White-space found word 'Neuer'
White-space found word 'Schwung'
White-space found word 'für'
White-space found word 'Monheimer'
White-space found word 'Alb'
White-space found word 'Förderprogramm'
White-space found word 'LEADER+'
White-space found word 'Path-Name:'
White-space found word '/srv/www/htdocs/krunet/keywords.pdf'
White-space found word 'Content-Length:'
White-space found word '861'
White-space found word 'Last-Mtime:'
White-space found word '1111480257'
White-space found word 'Document-Type:'
White-space found word 'HTML*'
 (40 words)

Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 28 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
28 unique words indexed.
6 properties sorted.
1 file indexed.  21,374 total bytes.  52 total words.
Elapsed time: 00:00:00 CPU time: 00:00:00
Indexing done!
swishe@local:~/swish-e/bin>

My config file is below:

   #####################################################
   #  Swish-e config to index the Intranet BZA files   #
   #                                                   #
   #   Use swish-e for indexing the /krunet folder     #
   #####################################################

IndexDir /srv/www/htdocs/krunet
   # Specify the program /folder to run

FollowSymLinks yes
   # Follow symbolic links in indexing
   
IndexName "Intranet DLE Krumbach"
IndexDescription "Index der Intranet Krumbach Dateien auf dem bza Rechner."
   # Name and description of the index

IndexFile /home/swishe/swish-e/index/intra_kr_fs.index
   # Index file name

FileFilter .pdf /home/swishe/swish-e/lib/swish-e/DirTree.pl
   # Using DirTree.pl for filtering .pdf files

IndexContents HTML2 .htm .html .pdf
IndexContents TXT2 .doc .xls
   # With HTML2 use libxml2 library (recommended)
   # With TXT2 for Word and Excel files

IndexOnly .htm .html .pdf .txt .doc .xls
   # Index only files ending in .htm, ... .

IgnoreWords
File:
/home/swishe/swish-e/share/doc/swish-e/examples/conf/stopwords/german.txt
   # words to be ignored by indexing
   
ReplaceRules replace "/srv/www/htdocs/" "http://localhost/"
   # Allows you to make changes to file pathnames before they're indexed.
   # These changed file names or URLs will be returned in search results.

PropertyNamesDate created_on
   # tell Swish that you have a property called created_on, and that it's a
timestamp
   
  #PropertyNames title author
   # List of meta tags names that can be retrieved with the -p option.
   # Index size increases as by the formula in the manual.
   # Comment out if no PropertyNames. Case insensitive

PropertyNameAlias swishtitle title
   # alias title to swishtitle

Metanames swishtitle swishdocpath swishlastmodified keywords
   # Allow extra searching by title, path, date

UndefinedMetaTags ignore
   # By default, undefined meta names are indexed as plain text
   # This feature can change this behaviour.  Here we say
   # don't index text in metatags unless defined in MetaNames

MetaNames automatic
   # MetaNames first author
   # List of all the meta names used in the file to index, must be on
one line.
   # If no metanames DO NOT deleted the line.
   # New in 2.0 -> automatic option will extract metanames dynamically

StoreDescription TXT* 200000
StoreDescription HTML* <body> 200000
   # Set StoreDescription for each parser
   # to display context with search results

FileRules pathname contains '/0_'
   # Don't index the directory with "0_"

FileRules filename contains '/0_' linker_frame
   # And don't index any files with "0_" and "linker_frame.htm" # and 'Kopie
von 0_hauptseite'

IndexReport 3
   # This is how detailed you want reporting. You can specify numbers
   # 0 to 3 - 0 is totally silent, 3 is the most verbose.
   # 4 is debugging.  Can be overridden with -v on the command line

ParserWarnLevel 1
   # Sets the error level when using the libxml2 parser for XML and HTML.
   # libxml2 will point out structural errors in your documents.
   # 0 = no report 1 = fatal errors 2 = errors 3 = warnings

But searching of the words 'LEADER' or 'Förderpogramm' gives no results!

Is my config file wrong?

Why does swish-filter-test display content?
..
">eta name="keywords" content="Förderprogramm LEADER+
..

With regards
Leonard Scheermann

>On Mon, Mar 21, 2005 at 06:30:49PM +0100, Scheermann Leonard wrote:
>> pdfinfo parses just the first line of "Keywords":
>> 
>> pdfinfo keywords.pdf
>> "
>> Title:          Neuer Schwung für Monheimer Alb
>> Subject:        LEADER+-Projekte sollen durch Netzwerk verbunden und für
>> alle nutzbar gemacht werden
>
>Is that wrapped from your mail program or did pdfinfo wrap that?
>
>> Keywords:       Förderprogramm LEADER+
>
>So, pdfinfo truncated that?
>
>A quick google turned up this:
>
>  http://www.tug.org/pipermail/pdftex/2003-December/004649.html
>
>You might try asking the author of xpdf about this.
>
Received on Tue Mar 22 01:42:02 2005