Skip to main content.
home | support | download

Back to List Archive

Re: DirTree works in pipe but not config file on PDF

From: Gertjan Hofman <gertjan_hofman(at)not-real.yahoo.com>
Date: Wed Jul 05 2006 - 23:07:44 GMT
Peter,

Took me  day to get back to this. The problem persists
- see below. The path/file is correct and yet it
claims it's not PDF. 

I wonder if I am just getting an incorrect error and I
am being misled. I have 5 test files in 
/home/ghofman/tmp10: a .doc, .txt, .ppt, .pdf and
rtf. When I run DirTree directly and pipe in swish-e
it parses all files correctly. When I use the config
file, only the .txt and .rtf result in words going to
the index file. See the second run below. It's unable
to parse the ppt, doc and pdf. Am I just having a path
problem or something like that ? How do I know where
the DirTree is trying to locate the parsing programs ?

Much appreciated

Gertjan




====RUN ON SINGLE PDF FILE =======

Indexing Data Source: "External-Program"
Indexing "/room/swish_index/DirTree.pl"
External Program found: /room/swish_index/DirTree.pl
Indexing /home/ghofman/tmp10/swish_text.pdf
Error: May not be a PDF file (continuing anyway)
Error (0): PDF file is damaged - attempting to
reconstruct xref table...
Error: Couldn't find trailer dictionary
Error: Couldn't read xref table
Removing very common words...
no words removed.
Writing main index...
err: No unique words indexed!
.

=== FULL RUN ON DIRECTORY ====


Indexing Data Source: "External-Program"
Indexing "/room/swish_index/DirTree.pl"
External Program found: /room/swish_index/DirTree.pl
Indexing now /home/ghofman/tmp10/swish_text.txt
Indexing now /home/ghofman/tmp10/swish_text.pdf
Indexing now /home/ghofman/tmp10/swish_test.xls
Indexing now /home/ghofman/tmp10/swish_test.doc
Indexing now /home/ghofman/tmp10/swish_test.rtf
Indexing now /home/ghofman/tmp10/swish_test.ppt
Error: May not be a PDF file (continuing anyway)
Error (0): PDF file is damaged - attempting to
reconstruct xref table...
Error: Couldn't find trailer dictionary
Error: Couldn't read xref table
/swtmpfltr0aS7OK is not OLE file or Error
/swtmpfltrHPmrp9 is not a Word Document.
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 17 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
17 unique words indexed.
4 properties sorted.                                  
           
5 files indexed.  5,010 total bytes.  22 total words.
Elapsed time: 00:00:03 CPU time: 00:00:00



--- Peter Karman <peter@peknet.com> wrote:

> edit your copy of DirTree.pl like this:
> 
> 
> sub check_path {
>      my $path = shift;
>      print STDERR "Indexing $path\n";
>      return 1;  # return true to process this file
> }
> 
> that will print the name of the path it is about to
> process.
> 
> 
> Gertjan Hofman scribbled on 6/30/06 5:14 PM:
> > Hi Peter,
> > 
> > yes, you are right. Below is the output.  I am
> finding
> > the order of the output a little confusion - it
> would
> > be good if SWISH-e would output the file name
> before
> > it starts processing. Anyway, I am open to
> > suggestions. As far as I can tell, it's just
> unhappy
> > with the PDF. So to me it seems the PDF parsing is
> > somehow different from the pipe example.
> > 
> > Gertjan
> > 
> > 
> > [ghofman@bi35-sensorinfo tmp]$ swish-e -v 5 -c
> > swish_file.conf -S prog
> > Parsing config file 'swish_file.conf'
> > Indexing Data Source: "External-Program"
> > Indexing "/room/swish_index/DirTree.pl"
> > External Program found:
> /room/swish_index/DirTree.pl
> > Error: May not be a PDF file (continuing anyway)
> > Error (0): PDF file is damaged - attempting to
> > reconstruct xref table...
> > Error: Couldn't find trailer dictionary
> > Error: Couldn't read xref table
> > /home/ghofman/tmp10/swish_text.pdf - Using HTML2
> > parser -  (no words indexed)
> > 
> > Removing very common words...
> > no words removed.
> > Writing main index...
> > err: No unique words indexed!
> > 
> > --- Peter Karman <peter@peknet.com> wrote:
> > 
> >> I was suggesting that the -v3 option would tell
> you
> >> if swish-e was in 
> >> fact parsing swish_test.pdf or if somehow it was
> >> being passed something 
> >> different. I just tried your example here and it
> >> worked for me, so I was 
> >> suggesting a way for you to start to debug what's
> >> going on.
> >>
> >> Gertjan Hofman scribbled on 6/30/06 3:59 PM:
> >>> Peter -
> >>>
> >>> Not sure I understand - I am passing only 1 file
> -
> >>> swish_test.pdf (as indiced in the config file I
> >>> enclosed).  Of course I started with entire
> >> folders
> >>> but for sake of demonstration of the problem
> only
> >>> parse the one file
> >>>
> >>> I note there are older messages in the mailing
> >> list
> >>> with similar sounding problems - in that case
> >>> spider.pl failed from a config file but worked
> in
> >> a
> >>> pipe...
> >>>
> >>> Thanks
> >>>
> >>> Gertjan
> >>>
> >>>
> >>> --- Peter Karman <peter@peknet.com> wrote:
> >>>
> >>>> Gertjan Hofman scribbled on 6/29/06 11:59 PM:
> >>>>
> >>>>> TRY 1: USING CONFIG FILE
> >>>>>
> >>>>> gertjan-laptop:~/tmp/swish_test> swish-e -S
> prog
> >>>> -c
> >>>>> swish_file.conf
> >>>>> Indexing Data Source: "External-Program"
> >>>>> Indexing "./DirTree.pl"
> >>>>> External Program found: ./DirTree.pl
> >>>>> Error: May not be a PDF file (continuing
> anyway)
> >>>>> Error (0): PDF file is damaged - attempting to
> >>>>> reconstruct xref table...
> >>>>> Error: Couldn't find trailer dictionary
> >>>>> Error: Couldn't read xref table
> >>>>> Removing very common words...
> >>>>> no words removed.
> >>>>> Writing main index...
> >>>>> err: No unique words indexed!
> >>>>>
> >>>> add the -v3 option to get more verbose. That
> >> should
> >>>> tell you the name of 
> >>>> the file being parsed with SWISH::Filter
> (xpdf).
> >> I'm
> >>>> betting the file 
> >>>> isn't getting passed correctly.
> >>>>
> >>>> -- 
> >>>> Peter Karman  .  http://peknet.com/  . 
> >>>> peter@peknet.com
> >>>>
> >>>
> >>>
> __________________________________________________
> >>> Do You Yahoo!?
> >>> Tired of spam?  Yahoo! Mail has the best spam
> >> protection around 
> >>> http://mail.yahoo.com 
> >>>
> >> -- 
> >> Peter Karman  .  http://peknet.com/  . 
> >> peter@peknet.com
> >>
> > 
> > 
> > __________________________________________________
> > Do You Yahoo!?
> > Tired of spam?  Yahoo! Mail has the best spam
> protection around 
> > http://mail.yahoo.com 
> > 
> 
> -- 
> Peter Karman  .  http://peknet.com/  . 
> peter@peknet.com
> 



__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 
Received on Wed Jul 5 16:07:50 2006