Skip to main content.
home | support | download

Back to List Archive

(no subject)

From: CARUSO Holly <Holly.Caruso(at)not-real.Tenix.com>
Date: Thu Jul 06 2006 - 08:55:21 GMT
=20

Hi,

=20

I'm trying to configure swish-e on my Windows XP machine to index pdf files=
, I then would like to use the cgi script to have a web interface...

=20

I have installed ActivePerl from the file: ActivePerl-5.8.8.817-MSWin32-x86=
-257965.msi to C:\Perl

And installed swish-e from the file: swish-e-2.4.3-win32.exe to C:\Program =
Files\SWISH-E

=20

Therefore,

Swish-e version: 2.4.3

Operating System: Windows XP Version 2002 Service Pack 1

=20

My swish.conf looks like:

=20

IndexName "Hardware Datasheets"

IndexDescription "This is an index of hardware datasheets from external sou=
rces."

IndexPointer C:\"Program Files"\SWISH-E

IndexAdmin "Swish-e Configuration Admin (holly.caruso@tenix.com)"

IndexDir P:\\datasheets

IndexOnly .pdf

FileFilter .pdf C:\"Program Files"\\SWISH-E\\share\\doc\\swish-e\\filter-bi=
n\\_pdf2html.pl

MetaNames title subject author swishdocpath

UndefinedMetaTags ignore

WordCharacters abcdefghijklmnopqrstuvwxyz0123456789.-#,\/=3D+:

IndexReport 3

IgnoreWords of or and the a to i

TranslateCharacters :ascii7:

BumpPositionCounterCharacters |.

StoreDescription TXT* 10000

StoreDescription HTML* <body> 10000

=20

=20

The _pdf2html.pl looks like

=20

#! /usr/bin/perl -w

use strict;

=20

# -- Filter PDF to simple HTML for swish

# --

# -- 2000-05  rasc

#

=3Dpod

=20

This filter requires two programs "pdfinfo" and "pdftotext"...

=20

  $ENV{PATH} =3D C:\\"Program Files"\\SWISH-E\\lib\\swish-e\\

=20

"pdfinfo" extracts...=20

=20

=3Dcut

=20

=20

my $file =3D shift || die "Usage: $0 <filename>\n";

=20

#

# -- read pdf meta information

#

=20

..

Nothing else in this file I have changed...

=20

I have done what is suggested, running the index on a single file with the =
following command:

C:\Program Files\SWISH-E>swish-e -i AM29LV128.pdf -T indexed_words

=20

I presume this commands doesn't use the swish.conf... some of the output fr=
om this commands is as follows:

=20

=20

    Adding:[1:swishdefault(1)]   '00000'   Pos:743  Stuct:0x9 ( BODY FILE )

    Adding:[1:swishdefault(1)]   'n'   Pos:744  Stuct:0x9 ( BODY FILE )

    Adding:[1:swishdefault(1)]   '0000058207'   Pos:745  Stuct:0x9 ( BODY F=
ILE )

=20

    Adding:[1:swishdefault(1)]   'v=F5j'   Pos:800  Stuct:0x9 ( BODY FILE )

    Adding:[1:swishdefault(1)]   'n=DEwi'   Pos:801  Stuct:0x9 ( BODY FILE )

    Adding:[1:swishdefault(1)]   '=DF'   Pos:802  Stuct:0x9 ( BODY FILE )

    Adding:[1:swishdefault(1)]   '=DA=F5'   Pos:803  Stuct:0x9 ( BODY FILE )

    Adding:[1:swishdefault(1)]   'm=DDi'   Pos:804  Stuct:0x9 ( BODY FILE )

    Adding:[1:swishdefault(1)]   '=FEjk=DA=B4'   Pos:805  Stuct:0x9 ( BODY =
FILE )

    Adding:[1:swishdefault(1)]   'y'   Pos:806  Stuct:0x9 ( BODY FILE )

    Adding:[1:swishdefault(1)]   ' '   Pos:807  Stuct:0x9 ( BODY FILE )

    Adding:[1:swishdefault(1)]   '=DA'   Pos:808  Stuct:0x9 ( BODY FILE )

    Adding:[1:swishdefault(1)]   '=B7=AF'   Pos:809  Stuct:0x9 ( BODY FILE )

    Adding:[1:swishdefault(1)]   '=DAc'   Pos:810  Stuct:0x9 ( BODY FILE )

Removing very common words...

no words removed.

Writing main index...

Sorting words ...

Sorting 306 words alphabetically

Writing header ...

Writing index entries ...

  Writing word text: Complete

  Writing word hash: Complete

  Writing word data: Complete

306 unique words indexed.

4 properties sorted.

1 file indexed.  652,348 total bytes.  806 total words.

Elapsed time: 00:00:00 CPU time: 00:00:00

Indexing done!

=20

It looks like it isn't indexing words properly... I don't know how to fix t=
he problem. Any help would be greatly appreciated as I'm working on a deadl=
ine.

=20

Thank you.

=20

=20

=20


Disclaimer :
The contents of this e-mail including any attachments are intended only
for the person or entity to which this e-mail is addressed.  If you are not,
or believe you may not be, the intended recipient, please advise the sender
immediately by return e-mail, delete this e-mail and destroy any copies.
Tenix does not warrant nor guarantee that this email communication is free
from errors, virus, interception or interference.




*********************************************************************
Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
*********************************************************************
Received on Thu Jul 6 01:55:30 2006