Skip to main content.
home | support | download

Back to List Archive

Re: Made a filter for powerpoint (ppt), included. Have

From: Peter Karman <peter(at)>
Date: Thu Feb 03 2005 - 01:05:16 GMT

Good doc is always appreciated. Especially examples.

I think this is the best place to post contributions that might end up in the 
distribution: doc, code, etc. The wiki is a project for conversation, project 
ideas, etc., but not a very good place to send code. Besides, some of us aren't 
in the daily habit of checking it as often as our email. :)


Randy wrote on 2/2/05 6:24 PM:

> Thank you very much for the fast and helpful reply.  You gave me
> plenty to work with, and clarified many issues for me with very few
> words.  I appreciate the time you've saved me in experimentation. 
> While you're free to do as you will with the filter I attached, I will
> post an improved version (at least with the title fix) to this list by
> the weekend if all goes well.
> I assume that's the preferred way to submit stuff, but I notice you
> have a wiki now.  Would you prefer I post it there when done?  Also, I
> think your documentation is excellent, especially after studying it a
> while, but for some reason it's hard for me to grasp in a short time. 
> I think an overview/intro summarizing how the program works (which I
> think I get now) would be very helpful and I would like to volunteer a
> contribution or two toward this end.  Would you prefer doc submissions
> via the mailing list, the wiki, or otherwise?
> Best regards,.
> Randy
> On Wed, 2 Feb 2005 09:23:09 -0800, Bill Moseley <> wrote:
>>On Wed, Feb 02, 2005 at 08:35:58AM -0800, Randy wrote:
>>>Once thing I was missing was a ppt filter; I saw a lot of requests for
>>>such a filter in the archive, but no working code.  It wasn't hard to
>>>make a basic working one, here it is (just put it in your Filters
>>>directory, and make sure the ppthtml executable is in your path)
>>Cool.  I'll add that to the distribution.
>>>I have not yet figured out how to pass a more useful title back to
>>>  The code above generates doc titles like "/tmp/foo1234"
>>>where I'd like to have the actual name of the .ppt file instead.  I'm
>>>still reading all the docs, so I'm sure I'll get to the answer
>>>eventually, but if anyone wants to give me a hint I won't mind :)
>>What about something like (not tried or tested):
>>  $$content =~ s/<title>[^<]+</title>/<title>$doc->name</title>/e;
>>>Another small item I miss from my htdig setup is automatic indexing
>>>inside .zip, .Z, .gz, .tar archives.  I'm not really sure how to chain
>>>the filters so that, after unzipping an archive, the ppt, doc, xls,
>>>html, txt, etc. files inside will be passed to the appropriate filter.
>>> Does this recursion happen automatically, or do I have to specify it
>>>in my config?
>>There's two different things happing there.  One is encoding and one
>>is the file format (mime type).
>>For encoding, for example, sets:
>> $request->header('Accept-encoding', 'gzip; deflate') if $can_uncompress;
>>telling the server it will accept that encoding and the spider will then
>>automatically uncompress the document.  It's content type will still
>>be the uncompress content type sent by the server.
>>Now, if the file is a compressed mime-type then you could
>>use a filter.  You can set a filter to run early in the sort order of
>>filters and then after filtering you set a flag saying that filtering
>>should continue -- as in a chain of filters.
>>.zip or .tar is another matter, as it can also be a collection of files.
>>Again, I think that would be better dealt with inside (or
>>whatever calls the filter code).  You would need to unpack all the
>>files and then one-by-one set the content-type and the process.
>>>Would it be possible to use FIleFilter directives (even though I'm
>>>using prog / )?  Something like:
>>>FileFIlter .gz gzip "-c '%p'"
>>>FileFIlter .zip unzip "-p '%p'"
>>>etc. for all compression/archive types?
>>No, not really.  Maybe for .gz if just a single file is compressed.
>>>Will the files inside each archive be passed along to the next
>>>appropriate filter?  How about (unfortunate cases) where there's a .gz
>>>or .tar file inside a .zip file?  I'd like to dig as deep as possible.
>>in there's a filter content callback.  What I'd do is a
>>recursive uncompression (decompression?) into temporary directories and for each one
>>set the content-type and then call spider's output_content function.
>>But what if one of the compressed files is .html.  Would you want to
>>search it for links to follow?  ;)
>>BTW -- I've been planning on rewriting for quite a while.  I
>>want to make the spider a class so that instead of having call-back
>>functions you would sub-class the spider to override its methods.
>>Bill Moseley
>>Unsubscribe from or help with the swish-e list:
>>Help with Swish-e:

Peter Karman  .  .  peter(at)
"One of the best things to come out of the home computer revolution
could be the general and widespread understanding of how severely limited logic 
really is."
- Frank Herbert (1920-1986, American Writer)
Received on Wed Feb 2 17:05:17 2005