Skip to main content.
home | support | download

Back to List Archive

Re: Indexing pdf files

From: David Cogley <david(at)not-real.cogley.com>
Date: Wed Jan 29 2003 - 23:42:47 GMT
This is a multi-part message in MIME format.

------=_NextPart_000_00DC_01C2C789.D55994E0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: 7bit

Bill,

Thank you for the quick response.

My results vary from yours.  Here is what I did.
1) Set up a configuration file identical to yours; 1 line.
2) From the command line in the directory with the configuration file:
     swish-e -c c -i /usr/share/cups/doc/translation.pdf > index.log
2>index_err.log
3) Copied several files to my Windows box and converted them to windows text
files.

>From the attached file dir_log.txt, you can see that an index was created.
>From the attached file index_log.txt, you can see that 3 words were indexed.
>From the attached file index_err_log.txt, you can see that there seems to be
a problem with _pdf2html.pl at line 101 with a tr///.

Do you have any more thoughts on this?

Thanks!
David Cogley


-----Original Message-----
From: Bill Moseley [mailto:moseley@hank.org]
Sent: Tuesday, January 28, 2003 5:52 PM
To: David Cogley
Cc: Multiple recipients of list
Subject: Re: [SWISH-E] Indexing pdf files


On Tue, 28 Jan 2003, David Cogley wrote:

> I'm having difficulty with indexing pdf files.  I create a large index,
but
> it seems to be garbage.  "strings gimppdr" gives me no terms I expected.

Well, seems like you are setting it up correctly.

Let me try:

$ cat c
FileFilter .pdf ./_pdf2html.pl

$ ../src/swish-e -c c -i /usr/share/cups/doc-root/translation.pdf
Indexing Data Source: "File-System"
Indexing "/usr/share/cups/doc-root/translation.pdf"
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 638 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
638 unique words indexed.
4 properties sorted.
1 file indexed.  50985 total bytes.  3066 total words.
Elapsed time: 00:00:00 CPU time: 00:00:00
Indexing done!


$ ../src/swish-e -w cups
# SWISH format: 2.3.4
# Search words: cups
# Removed stopwords:
# Number of hits: 1
# Search time: 0.002 seconds
# Run time: 0.031 seconds
1000 /usr/share/cups/doc-root/translation.pdf "CUPS Translation Guide"
50985
.


$ ../src/swish-e -T index_words_only | wc -l
    639

$ ../src/swish-e -T index_words_only | tail
will
windows
with
within
world
would
x
you
your

Can you repeat that with your pdf file?


--
Bill Moseley moseley@hank.org



------=_NextPart_000_00DC_01C2C789.D55994E0
Content-Type: text/plain;
	name="dir_log.txt"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
	filename="dir_log.txt"

total 424
-rw-rw-r--    1 david    david         103 Jan 29 11:02 c
-rw-rw-r--    1 david    david           0 Jan 29 11:04 dir.log
-rw-rw-r--    1 david    david       10498 Jan 29 11:03 index_err.log
-rw-rw-r--    1 david    david        1108 Jan 29 11:03 index.log
-rw-r--r--    1 david    david      402249 Jan 29 11:03 index.swish-e
-rw-r--r--    1 david    david          73 Jan 29 11:03 index.swish-e.prop


------=_NextPart_000_00DC_01C2C789.D55994E0
Content-Type: text/plain;
	name="index_log.txt"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;
	filename="index_log.txt"

Indexing Data Source: "File-System"=0A=
Indexing "/usr/share/cups/doc/translation.pdf"=0A=
Removing very common words...=0A=
no words removed.=0A=
Writing main index...=0A=
Sorting words ...=0A=
Sorting 3 words alphabetically=0A=
Writing header ...=0A=
Writing index entries ...=0A=
  Writing word text: ...
  Writing word text: Complete=0A=
  Writing word hash: ...
  Writing word hash:  10%
  Writing word hash:  20%
  Writing word hash:  30%
  Writing word hash:  40%
  Writing word hash:  50%
  Writing word hash:  60%
  Writing word hash:  70%
  Writing word hash:  80%
  Writing word hash:  90%
  Writing word hash: 100%
  Writing word hash: Complete=0A=
  Writing word data: ...
  Writing word data: Complete=0A=
3 unique words indexed.=0A=
Sorting property: swishdocpath                           =20
Sorting property: swishtitle                             =20
Sorting property: swishdocsize                           =20
Sorting property: swishlastmodified                      =20
4 properties sorted.                                              =0A=
1 file indexed.  50985 total bytes.  3 total words.=0A=
Elapsed time: 00:00:01 CPU time: 00:00:00=0A=
Indexing done!=0A=


------=_NextPart_000_00DC_01C2C789.D55994E0
Content-Type: text/plain;
	name="index_err_log.txt"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;
	filename="index_err_log.txt"

Malformed UTF-8 character (unexpected continuation byte 0xae, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 77.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 99.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 101.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 103.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 105.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 107.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 109.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 135.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 137.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 139.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 141.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 143.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 145.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 147.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 149.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 151.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 153.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 161.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 163.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 165.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 167.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 169.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 171.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 173.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 175.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 177.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 179.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 181.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 183.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 217.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 219.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 221.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 223.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 225.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 227.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 229.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 231.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 233.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 235.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 237.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 239.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 241.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 243.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 245.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 247.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 249.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 251.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 253.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 255.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 257.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 259.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 261.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 263.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 265.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 267.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 269.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 576.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 578.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 580.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 582.


------=_NextPart_000_00DC_01C2C789.D55994E0--
Received on Wed Jan 29 23:43:02 2003