Skip to main content.
home | support | download

Back to List Archive

Re: DirTree works in pipe but not config file on PDF

From: Gertjan Hofman <gertjan_hofman(at)not-real.yahoo.com>
Date: Thu Jul 06 2006 - 18:33:03 GMT
Problem - kind of solved.  It turns out, the
FileFilter directive in the conf file muck up the
DirTree.pl program.  
i.e.
FileFilter .pdf       /usr/bin/pdftotext   "'%p' -"

which works fine when *not* using -S prog seems to
interfere when using -S prog and DirTree.pl.

Clearly I am not understanding something. The
documentation would suggest that there are TWO
independent methods - FileFilter, or SWISH::Filter,
the latter being invoked by DirTree.pl.  So why do the
FileFilter directive matter when using DirTree.pl and
why does it muck up the PDF parsin.  Odd.

Thanks for your help Peter

Gertjan



--- Peter Karman <peter@peknet.com> wrote:

> here's my test. see if you can mimic it exactly:
> 
> [karpet@cartermac:~/tmp/s]$ swish-e -c conf -S prog
> -v3 -W0
> Parsing config file 'conf'
> Indexing Data Source: "External-Program"
> Indexing "/usr/local/lib/swish-e/DirTree.pl"
> External Program found:
> /usr/local/lib/swish-e/DirTree.pl
> Indexing ./test.pdf
> ./test.pdf - Using HTML2 parser -  (38 words)
> 
> Removing very common words...
> no words removed.
> Writing main index...
> Sorting words ...
> Sorting 26 words alphabetically
> Writing header ...
> Writing index entries ...
>    Writing word text: Complete
>    Writing word hash: Complete
>    Writing word data: Complete
> 26 unique words indexed.
> 4 properties sorted.
> 1 file indexed.  583 total bytes.  38 total words.
> Elapsed time: 00:00:03 CPU time: 00:00:00
> Indexing done!
> [karpet@cartermac:~/tmp/s]$ cat conf
> #
> IndexDir /usr/local/lib/swish-e/DirTree.pl
> 
> SwishProgParameters test.pdf
> 
> # end of the config file
> 
> 
> 
> Since you say that it works fine if you run
> DirTree.pl directly on the 
> files, I don't suspect a bad .pdf file etc. I'm not
> sure what's going on 
> with your setup -- maybe try the full path to the
> DirTree.pl command?
> 
> 
> 
> 
> 
> Gertjan Hofman scribbled on 7/5/06 6:05 PM:
> > Peter,
> > 
> > Took me  day to get back to this. The problem
> persists
> > - see below. The path/file is correct and yet it
> > claims it's not PDF. 
> > 
> > I wonder if I am just getting an incorrect error
> and I
> > am being misled. I have 5 test files in 
> > /home/ghofman/tmp10: a .doc, .txt, .ppt, .pdf and
> > .rtf. When I run DirTree directly and pipe in
> swish-e
> > it parses all files correctly. When I use the
> config
> > file, only the .txt and .rtf result in words going
> to
> > the index file. See the second run below. It's
> unable
> > to parse the ppt, doc and pdf. Am I just having a
> path
> > problem or something like that ? How do I know
> where
> > the DirTree is trying to locate the parsing
> programs ?
> > 
> > Much appreciated
> > 
> > Gertjan
> > 
> > 
> > 
> > 
> > ====RUN ON SINGLE PDF FILE =======
> > 
> > Indexing Data Source: "External-Program"
> > Indexing "/room/swish_index/DirTree.pl"
> > External Program found:
> /room/swish_index/DirTree.pl
> > Indexing /home/ghofman/tmp10/swish_text.pdf
> > Error: May not be a PDF file (continuing anyway)
> > Error (0): PDF file is damaged - attempting to
> > reconstruct xref table...
> > Error: Couldn't find trailer dictionary
> > Error: Couldn't read xref table
> > Removing very common words...
> > no words removed.
> > Writing main index...
> > err: No unique words indexed!
> > .
> > 
> > === FULL RUN ON DIRECTORY ====
> > 
> > 
> > Indexing Data Source: "External-Program"
> > Indexing "/room/swish_index/DirTree.pl"
> > External Program found:
> /room/swish_index/DirTree.pl
> > Indexing now /home/ghofman/tmp10/swish_text.txt
> > Indexing now /home/ghofman/tmp10/swish_text.pdf
> > Indexing now /home/ghofman/tmp10/swish_test.xls
> > Indexing now /home/ghofman/tmp10/swish_test.doc
> > Indexing now /home/ghofman/tmp10/swish_test.rtf
> > Indexing now /home/ghofman/tmp10/swish_test.ppt
> > Error: May not be a PDF file (continuing anyway)
> > Error (0): PDF file is damaged - attempting to
> > reconstruct xref table...
> > Error: Couldn't find trailer dictionary
> > Error: Couldn't read xref table
> > ./swtmpfltr0aS7OK is not OLE file or Error
> > ./swtmpfltrHPmrp9 is not a Word Document.
> > Removing very common words...
> > no words removed.
> > Writing main index...
> > Sorting words ...
> > Sorting 17 words alphabetically
> > Writing header ...
> > Writing index entries ...
> >   Writing word text: Complete
> >   Writing word hash: Complete
> >   Writing word data: Complete
> > 17 unique words indexed.
> > 4 properties sorted.                              
>    
> >            
> > 5 files indexed.  5,010 total bytes.  22 total
> words.
> > Elapsed time: 00:00:03 CPU time: 00:00:00
> > 
> > 
> > 
> > --- Peter Karman <peter@peknet.com> wrote:
> > 
> >> edit your copy of DirTree.pl like this:
> >>
> >>
> >> sub check_path {
> >>      my $path = shift;
> >>      print STDERR "Indexing $path\n";
> >>      return 1;  # return true to process this
> file
> >> }
> >>
> >> that will print the name of the path it is about
> to
> >> process.
> >>
> >>
> >> Gertjan Hofman scribbled on 6/30/06 5:14 PM:
> >>> Hi Peter,
> >>>
> >>> yes, you are right. Below is the output.  I am
> >> finding
> >>> the order of the output a little confusion - it
> >> would
> >>> be good if SWISH-e would output the file name
> >> before
> >>> it starts processing. Anyway, I am open to
> >>> suggestions. As far as I can tell, it's just
> >> unhappy
> >>> with the PDF. So to me it seems the PDF parsing
> is
> >>> somehow different from the pipe example.
> >>>
> >>> Gertjan
> >>>
> >>>
> >>> [ghofman@bi35-sensorinfo tmp]$ swish-e -v 5 -c
> >>> swish_file.conf -S prog
> >>> Parsing config file 'swish_file.conf'
> >>> Indexing Data Source: "External-Program"
> >>> Indexing "/room/swish_index/DirTree.pl"
> >>> External Program found:
> >> /room/swish_index/DirTree.pl
> >>> Error: May not be a PDF file (continuing anyway)
> >>> Error (0): PDF file is damaged - attempting to
> >>> reconstruct xref table...
> >>> Error: Couldn't find trailer dictionary
> >>> Error: Couldn't read xref table
> >>> /home/ghofman/tmp10/swish_text.pdf - Using HTML2
> >>> parser -  (no words indexed)
> >>>
> >>> Removing very common words...
> >>> no words removed.
> >>> Writing main index...
> >>> err: No unique words indexed!
> >>>
> >>> --- Peter Karman <peter@peknet.com> wrote:
> >>>
> >>>> I was suggesting that the -v3 option would tell
> >> you
> >>>> if swish-e was in 
> 
=== message truncated ===


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 
Received on Thu Jul 6 11:33:20 2006