Skip to main content.
home | support | download

Back to List Archive

Re: PowerPoint module for spider.pl

From: Alan Ivey <ai4891(at)not-real.yahoo.com>
Date: Wed Jul 07 2004 - 18:33:40 GMT
If I can get/make a PowerPoint filter or module, this
question would eventually need to be asked by me...

I currently have a fully-functional, and awesome,
SWISH-E configuration using the built-in spider (-S
http) and FileFilters. But I'd like to try and use
spider.pl with SWISH::Filter to minimize the CPU-load
on my feeble 500MHz P3 (512MB RAM). For testing
purposes, I put one of each file type on my localhost
root (.pdf, .doc, .ppt, .html, etc.) so I can see if
they can all be filtered. Also, I just copied
SwishSpiderConfig.pl and changed the server to my own,
so it looks like...

@servers = (
    # Localhost
    {
        skip        => 0,
        
        base_url    => 'http://localhost',
        same_hosts  => [ qw/127.0.0.1/ ],
        agent       => 'swish-e spider
http://swish-e.org/',
        email       => 'alan@localhost',

        delay_sec   => 2,
        max_time    => 10,
        max_files   => 100,
        max_indexed => 20, 
        keep_alive  => 1,  
        filter_content  => \&filter_content,
    },
);    

To save everyone from scrolling, I kept the rest of
the file's contents, assuming it's relevant to
SWISH::Filter. If you open the default
SwishSpiderConfig.pl in
$prefix/share/doc/swish-e/examples/prog-bin, see lines
167 to the end. 

I've read the Docs serveral times, and searched on the
mailing list, and I'm just not getting it. But like
I've said before on this list, I'm recently new to
Linux, and I don't really know much of anything in the
way of Perl. So, my question is... do I just have to
put modules in the
/usr/local/lib/swish-e/perl/SWISH/Filters directory,
and then they'll automatically be processed? Don't I
have to set the content type somewhere? Wherever they
go doesn't jump out to me, a newbie in the sample
file.

I wish I knew more Perl :( Tis frustrating.

I ran swish-filter-test and it seems there needs to be
more than just an existing module. The first time I
ran it, it said I needed MIME::Type and MIME::Types so
I added those to a suitable Perl folder. Here's the
results of my .doc test, even with Doc2txt.pm being in
the SWISH Filter folder...

$ swish-filter-test -verbose AwardsCallandCriteria.doc
SWISH::Filter found at
[/usr/local/lib/swish-e/perl/SWISH/Filter.pm]
>> Loading filter: [SWISH/Filters/Pdf2HTML.pm]
Find path of [pdftotext] in
/usr/local/bin:/usr/bin:/bin:/usr/local/lib/swish-e
 * Found program at: [/usr/bin/pdftotext]
 
Find path of [pdfinfo] in
/usr/local/bin:/usr/bin:/bin:/usr/local/lib/swish-e
 * Found program at: [/usr/bin/pdfinfo]
 
>> Loading filter: [SWISH/Filters/Doc2txt.pm]
Find path of [catdoc] in
/usr/local/bin:/usr/bin:/bin:/usr/local/lib/swish-e
 * Found program at: [/usr/local/bin/catdoc]
 
>> Starting to process new document:
application/x-msword
 ++Checking filter
[SWISH::Filters::Pdf2HTML=HASH(0x8d1bdc0)] for
application/x-msword
 ++ application/x-msword was not filtered by
SWISH::Filters::Pdf2HTML=HASH(0x8d1bdc0)
 
 ++Checking filter
[SWISH::Filters::Doc2txt=HASH(0x8ff3b3c)] for
application/x-msword
 ++ application/x-msword was not filtered by
SWISH::Filters::Doc2txt=HASH(0x8ff3b3c)
 
Final Content type for AwardsCallandCriteria.doc is
application/x-msword
  *No filters were used
 
Document AwardsCallandCriteria.doc was not filtered.
   Document:     AwardsCallandCriteria.doc 
(AwardsCallandCriteria.doc)
   Content-Type: application/x-msword
   Parser type:
 
** /usr/local/bin/swish-filter-test:
  Skipping binary [AwardsCallandCriteria.doc]
$

For what it's worth, here's one of a PDF...

>> Starting to process new document: application/pdf
 ++Checking filter
[SWISH::Filters::Pdf2HTML=HASH(0x8742db8)] for
application/pdf
 ++ application/pdf *WAS* filtered by
SWISH::Filters::Pdf2HTML=HASH(0x8742db8)
 
 
Final Content type for businessunit.pdf is text/html
  >Filter SWISH::Filters::Pdf2HTML=HASH(0x8742db8)
converted from [application/pdf] to [text/html]
 
Document businessunit.pdf was  filtered.
   Document:     businessunit.pdf  (businessunit.pdf)
   Content-Type: text/html
   Parser type:  HTML*
 
   >Filter used:
SWISH::Filters::Pdf2HTML=HASH(0x8742db8) (
application/pdf -> text/html )
-- Output Content Sample --
<html>
<head>
<meta name="author" content="SAIC">
<meta name="creationdate" content="Tue Jun 29 14:16:31
2004">
<meta name="creator" content="Impress">
<meta name="encrypted" content="no">
<meta name="file_size" content="396842 bytes">
<meta name="optimized" content="no">
<meta name="page_size" content="720 x 540 pts">
<meta name="pages" content="16">
<meta name="pdf_version" content="1.4">
 
-- end --
$

What am I missing? I've read this stuff over and over
again but my lack of experience is keeping me from
grasping this. Sorry... as always, thanks for the
help!

--- Alan Ivey <ai4891@yahoo.com> wrote:
> This has probably been asked a dozen times, but,
> where
> can I find a SWISH module for PowerPoints to use
> with
> spider.pl? I searched the mailing list
> (http://www.swish-e.org/Discussion/) and the closest
> thing I found was an Excel to HTML module. I would
> prefer PPT to TXT, but to HTML would work just fine
> for my purposes. I'm not proficient enough with Perl
> to write my own module, but I was able to find a
> FileFilter using a bash script... 
> ppthtml $1 | w3m -dump -T text/html | perl -pe
> 's/\xa0/ /g'
> 
> Anyway, I don't know how to make that happen in
> Perl,
> much less make it into a module. Any help would be
> greatly appreciated! I think once I get this I'll be
> able to go live with SWISH-E!! :)
> 
> 
> 		
> __________________________________
> Do you Yahoo!?
> New and Improved Yahoo! Mail - Send 10MB messages!
> http://promotions.yahoo.com/new_mail 
> 



	
		
__________________________________
Do you Yahoo!?
New and Improved Yahoo! Mail - 100MB free storage!
http://promotions.yahoo.com/new_mail 
Received on Wed Jul 7 11:33:53 2004