Skip to main content.
home | support | download

Back to List Archive

Re: Indexing .doc .ppt .xls with filters and prog method

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Fri Aug 19 2005 - 11:16:44 GMT
The .pm files:

  doc2txt.pm
  pdf2html.pm
  pdf2xml.pm

are example modules that predate (iirc) the SWISH::Filters class. The reason 
pdf2html works in your script is this line in the pdf2html.pm file:

   @EXPORT = qw(pdf2html);

which tells Perl to make that function available in your script's namespace with 
the 'use' function.

I'd suggest using the DirTree.pl example script instead; it calls SWISH::Filter 
for you correctly.

Benoit Guguin scribbled on 8/19/05 4:45 AM:

> Hello,
> 
> I try to index a directory with only pdf, doc, xls and ppt.
> 
> 
> I've seen in version 2.5.4 some perl script to filter .ppt, .xls and .doc. 
> 
> I try to use them  with the prog method but when I run swish-e ( 
> "swish-e -c /etc/swish-e/swish.conf -S prog") I have thoses erros :
> 
> Undefined subroutine &main::Doc2html called at /etc/swish-e/swish.pl 
> line 55.
> Or
> Undefined subroutine &main::pp2hml called at /etc/swish-e/swish.pl
> 
> The error depends of the order of the functions.
> 
> 
> So I don't undestand  why it's work fine for pdf but not for others 
> format...
> 
> I'm looking around ml archive but dont find my St Graal;)
> 
> Any idea please ?
> 
> Regards,
> 
> 
> My configurations files :
> 
> /etc/swish-e/swish.conf
> ----------------------------------------------------------------------------------------------------------------------------------
> WordCharacters abcdefghijklmnopqrstuvwxyz0123456789.-
> 
> IgnoreFirstChar .-
> IgnoreLastChar  .-
> 
> BeginCharacters abcdefghijklmnopqrstuvwxyz0123456789
> EndCharacters   abcdefghijklmnopqrstuvwxyz0123456789
> 
> #FollowSymLinks yes
> 
> IndexReport 3
> 
> IndexDir /etc/swish-e/swish.pl
> 
> IndexFile  /var/lib/swish/index.swish-e
> SwishProgParameters /format_ms/
> 
> IndexContents TXT .config
> IndexContents HTML .doc .xls .ppt .pdf
> UndefinedMetaTags auto
> -------------------------------------------------------------------------------------------------------------------------------------------
> 
> /etc/swish-e/swish.pl
> ------------------------------------------------------------------------------------------------------------------------------------------
> #!/usr/bin/perl -w
> use strict;
> use lib '../prog-bin';
> use lib '/usr/local/lib/swish-e/perl/';
> use lib '/usr/local/lib/swish-e/';
> 
> use File::Find;
> #use SWISH::Filter;
> #use SWISH::Filters::Pdf2HTML;
> #use SWISH::Filters::pp2html;
> #use SWISH::Filters::Doc2html;
> #use SWISH::Filters::XLtoHTML;
> use pdf2html;    
> use pp2html;       
> use XLtoHTML;
> use Doc2html; 
> 
> use constant DEBUG => 1;
> my $dir = shift || '.';
> 
> find(
>     {
>         wanted => \&wanted,
>         no_chdir => 1,
>     },
>     $dir,
> );
> 
> sub wanted {
>     return if -d;
>     if ( /\.pdf$/ ) {
>         print STDERR "Indexing pdf $File::Find::name\n" if DEBUG;
>         print ${ pdf2html ( $File::Find::name ) };
> 
>     } elsif ( /\.doc$/ ) {
>         print STDERR "Indexing doc $File::Find::name\n" if DEBUG;
>         print ${ Doc2html ($File::Find::name ) };
> 
>      } elsif (  /\.ppt$/ ) {
>         print STDERR "Indexing ppt $File::Find::name\n" if DEBUG;
>         print ${ pp2html ($File::Find::name ) };
> 
>     } elsif ( /\.xls$/ ) {
>         print STDERR "Indexing xls $File::Find::name\n" if DEBUG;
>         print ${ XLtoHTML ($File::Find::name ) };
> 
> 
>     } elsif ( /\.config$/ ) {
>         print STDERR "Indexing $File::Find::name\n" if DEBUG;
>         print ${ get_content( $File::Find::name ) };
> 
>     } else {
>         print STDERR "Skipping $File::Find::name\n" if DEBUG;
>     }
> }
> 
> sub get_content {
>     my $path = shift;
> 
>     my ( $size, $mtime )  = (stat $path )[7,9];
>     open FH, $path or die "$path: $!";
> 
>     my $content =  <<EOF;
> Content-Length: $size
> Last-Mtime: $mtime
> Path-Name: $path
> 
> EOF
>     local $/ = undef;
>     $content .= <FH>;
>     return \$content;
> }
> 
> ---------------------------------------------------------------------------------------------------------------------------------------------
> 
> 

-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
Received on Fri Aug 19 04:16:47 2005