edit your copy of DirTree.pl like this:
sub check_path {
my $path = shift;
print STDERR "Indexing $path\n";
return 1; # return true to process this file
}
that will print the name of the path it is about to process.
Gertjan Hofman scribbled on 6/30/06 5:14 PM:
> Hi Peter,
>
> yes, you are right. Below is the output. I am finding
> the order of the output a little confusion - it would
> be good if SWISH-e would output the file name before
> it starts processing. Anyway, I am open to
> suggestions. As far as I can tell, it's just unhappy
> with the PDF. So to me it seems the PDF parsing is
> somehow different from the pipe example.
>
> Gertjan
>
>
> [ghofman@bi35-sensorinfo tmp]$ swish-e -v 5 -c
> swish_file.conf -S prog
> Parsing config file 'swish_file.conf'
> Indexing Data Source: "External-Program"
> Indexing "/room/swish_index/DirTree.pl"
> External Program found: /room/swish_index/DirTree.pl
> Error: May not be a PDF file (continuing anyway)
> Error (0): PDF file is damaged - attempting to
> reconstruct xref table...
> Error: Couldn't find trailer dictionary
> Error: Couldn't read xref table
> /home/ghofman/tmp10/swish_text.pdf - Using HTML2
> parser - (no words indexed)
>
> Removing very common words...
> no words removed.
> Writing main index...
> err: No unique words indexed!
>
> --- Peter Karman <peter@peknet.com> wrote:
>
>> I was suggesting that the -v3 option would tell you
>> if swish-e was in
>> fact parsing swish_test.pdf or if somehow it was
>> being passed something
>> different. I just tried your example here and it
>> worked for me, so I was
>> suggesting a way for you to start to debug what's
>> going on.
>>
>> Gertjan Hofman scribbled on 6/30/06 3:59 PM:
>>> Peter -
>>>
>>> Not sure I understand - I am passing only 1 file -
>>> swish_test.pdf (as indiced in the config file I
>>> enclosed). Of course I started with entire
>> folders
>>> but for sake of demonstration of the problem only
>>> parse the one file
>>>
>>> I note there are older messages in the mailing
>> list
>>> with similar sounding problems - in that case
>>> spider.pl failed from a config file but worked in
>> a
>>> pipe...
>>>
>>> Thanks
>>>
>>> Gertjan
>>>
>>>
>>> --- Peter Karman <peter@peknet.com> wrote:
>>>
>>>> Gertjan Hofman scribbled on 6/29/06 11:59 PM:
>>>>
>>>>> TRY 1: USING CONFIG FILE
>>>>>
>>>>> gertjan-laptop:~/tmp/swish_test> swish-e -S prog
>>>> -c
>>>>> swish_file.conf
>>>>> Indexing Data Source: "External-Program"
>>>>> Indexing "./DirTree.pl"
>>>>> External Program found: ./DirTree.pl
>>>>> Error: May not be a PDF file (continuing anyway)
>>>>> Error (0): PDF file is damaged - attempting to
>>>>> reconstruct xref table...
>>>>> Error: Couldn't find trailer dictionary
>>>>> Error: Couldn't read xref table
>>>>> Removing very common words...
>>>>> no words removed.
>>>>> Writing main index...
>>>>> err: No unique words indexed!
>>>>>
>>>> add the -v3 option to get more verbose. That
>> should
>>>> tell you the name of
>>>> the file being parsed with SWISH::Filter (xpdf).
>> I'm
>>>> betting the file
>>>> isn't getting passed correctly.
>>>>
>>>> --
>>>> Peter Karman . http://peknet.com/ .
>>>> peter@peknet.com
>>>>
>>>
>>> __________________________________________________
>>> Do You Yahoo!?
>>> Tired of spam? Yahoo! Mail has the best spam
>> protection around
>>> http://mail.yahoo.com
>>>
>> --
>> Peter Karman . http://peknet.com/ .
>> peter@peknet.com
>>
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
--
Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com
Received on Mon Jul 3 09:27:52 2006