Skip to main content.
home | support | download

Back to List Archive

Re: DirTree works in pipe but not config file on PDF

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Mon Jul 03 2006 - 16:27:41 GMT
edit your copy of DirTree.pl like this:


sub check_path {
     my $path = shift;
     print STDERR "Indexing $path\n";
     return 1;  # return true to process this file
}

that will print the name of the path it is about to process.


Gertjan Hofman scribbled on 6/30/06 5:14 PM:
> Hi Peter,
> 
> yes, you are right. Below is the output.  I am finding
> the order of the output a little confusion - it would
> be good if SWISH-e would output the file name before
> it starts processing. Anyway, I am open to
> suggestions. As far as I can tell, it's just unhappy
> with the PDF. So to me it seems the PDF parsing is
> somehow different from the pipe example.
> 
> Gertjan
> 
> 
> [ghofman@bi35-sensorinfo tmp]$ swish-e -v 5 -c
> swish_file.conf -S prog
> Parsing config file 'swish_file.conf'
> Indexing Data Source: "External-Program"
> Indexing "/room/swish_index/DirTree.pl"
> External Program found: /room/swish_index/DirTree.pl
> Error: May not be a PDF file (continuing anyway)
> Error (0): PDF file is damaged - attempting to
> reconstruct xref table...
> Error: Couldn't find trailer dictionary
> Error: Couldn't read xref table
> /home/ghofman/tmp10/swish_text.pdf - Using HTML2
> parser -  (no words indexed)
> 
> Removing very common words...
> no words removed.
> Writing main index...
> err: No unique words indexed!
> 
> --- Peter Karman <peter@peknet.com> wrote:
> 
>> I was suggesting that the -v3 option would tell you
>> if swish-e was in 
>> fact parsing swish_test.pdf or if somehow it was
>> being passed something 
>> different. I just tried your example here and it
>> worked for me, so I was 
>> suggesting a way for you to start to debug what's
>> going on.
>>
>> Gertjan Hofman scribbled on 6/30/06 3:59 PM:
>>> Peter -
>>>
>>> Not sure I understand - I am passing only 1 file -
>>> swish_test.pdf (as indiced in the config file I
>>> enclosed).  Of course I started with entire
>> folders
>>> but for sake of demonstration of the problem only
>>> parse the one file
>>>
>>> I note there are older messages in the mailing
>> list
>>> with similar sounding problems - in that case
>>> spider.pl failed from a config file but worked in
>> a
>>> pipe...
>>>
>>> Thanks
>>>
>>> Gertjan
>>>
>>>
>>> --- Peter Karman <peter@peknet.com> wrote:
>>>
>>>> Gertjan Hofman scribbled on 6/29/06 11:59 PM:
>>>>
>>>>> TRY 1: USING CONFIG FILE
>>>>>
>>>>> gertjan-laptop:~/tmp/swish_test> swish-e -S prog
>>>> -c
>>>>> swish_file.conf
>>>>> Indexing Data Source: "External-Program"
>>>>> Indexing "./DirTree.pl"
>>>>> External Program found: ./DirTree.pl
>>>>> Error: May not be a PDF file (continuing anyway)
>>>>> Error (0): PDF file is damaged - attempting to
>>>>> reconstruct xref table...
>>>>> Error: Couldn't find trailer dictionary
>>>>> Error: Couldn't read xref table
>>>>> Removing very common words...
>>>>> no words removed.
>>>>> Writing main index...
>>>>> err: No unique words indexed!
>>>>>
>>>> add the -v3 option to get more verbose. That
>> should
>>>> tell you the name of 
>>>> the file being parsed with SWISH::Filter (xpdf).
>> I'm
>>>> betting the file 
>>>> isn't getting passed correctly.
>>>>
>>>> -- 
>>>> Peter Karman  .  http://peknet.com/  . 
>>>> peter@peknet.com
>>>>
>>>
>>> __________________________________________________
>>> Do You Yahoo!?
>>> Tired of spam?  Yahoo! Mail has the best spam
>> protection around 
>>> http://mail.yahoo.com 
>>>
>> -- 
>> Peter Karman  .  http://peknet.com/  . 
>> peter@peknet.com
>>
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around 
> http://mail.yahoo.com 
> 

-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
Received on Mon Jul 3 09:27:52 2006