Skip to main content.
home | support | download

Back to List Archive

Re: Made a filter for powerpoint (ppt), included. Have

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Thu Feb 03 2005 - 01:05:16 GMT
Randy-

Good doc is always appreciated. Especially examples.

I think this is the best place to post contributions that might end up in the 
distribution: doc, code, etc. The wiki is a project for conversation, project 
ideas, etc., but not a very good place to send code. Besides, some of us aren't 
in the daily habit of checking it as often as our email. :)

pek

Randy wrote on 2/2/05 6:24 PM:

> Thank you very much for the fast and helpful reply.  You gave me
> plenty to work with, and clarified many issues for me with very few
> words.  I appreciate the time you've saved me in experimentation. 
> While you're free to do as you will with the filter I attached, I will
> post an improved version (at least with the title fix) to this list by
> the weekend if all goes well.
> 
> I assume that's the preferred way to submit stuff, but I notice you
> have a wiki now.  Would you prefer I post it there when done?  Also, I
> think your documentation is excellent, especially after studying it a
> while, but for some reason it's hard for me to grasp in a short time. 
> I think an overview/intro summarizing how the program works (which I
> think I get now) would be very helpful and I would like to volunteer a
> contribution or two toward this end.  Would you prefer doc submissions
> via the mailing list, the wiki, or otherwise?
> 
> Best regards,.
> Randy
> 
> 
> On Wed, 2 Feb 2005 09:23:09 -0800, Bill Moseley <moseley@hank.org> wrote:
> 
>>On Wed, Feb 02, 2005 at 08:35:58AM -0800, Randy wrote:
>>
>>>Once thing I was missing was a ppt filter; I saw a lot of requests for
>>>such a filter in the archive, but no working code.  It wasn't hard to
>>>make a basic working one, here it is (just put it in your Filters
>>>directory, and make sure the ppthtml executable is in your path)
>>
>>Cool.  I'll add that to the distribution.
>>
>>
>>>I have not yet figured out how to pass a more useful title back to
>>>Filter.pm.  The code above generates doc titles like "/tmp/foo1234"
>>>where I'd like to have the actual name of the .ppt file instead.  I'm
>>>still reading all the docs, so I'm sure I'll get to the answer
>>>eventually, but if anyone wants to give me a hint I won't mind :)
>>
>>What about something like (not tried or tested):
>>
>>  $$content =~ s/<title>[^<]+</title>/<title>$doc->name</title>/e;
>>
>>
>>>Another small item I miss from my htdig setup is automatic indexing
>>>inside .zip, .Z, .gz, .tar archives.  I'm not really sure how to chain
>>>the filters so that, after unzipping an archive, the ppt, doc, xls,
>>>html, txt, etc. files inside will be passed to the appropriate filter.
>>> Does this recursion happen automatically, or do I have to specify it
>>>in my config?
>>
>>There's two different things happing there.  One is encoding and one
>>is the file format (mime type).
>>
>>For encoding spider.pl, for example, sets:
>>
>> $request->header('Accept-encoding', 'gzip; deflate') if $can_uncompress;
>>
>>telling the server it will accept that encoding and the spider will then
>>automatically uncompress the document.  It's content type will still
>>be the uncompress content type sent by the server.
>>
>>Now, if the file is a compressed mime-type then you could
>>use a filter.  You can set a filter to run early in the sort order of
>>filters and then after filtering you set a flag saying that filtering
>>should continue -- as in a chain of filters.
>>
>>.zip or .tar is another matter, as it can also be a collection of files.
>>Again, I think that would be better dealt with inside spider.pl (or
>>whatever calls the filter code).  You would need to unpack all the
>>files and then one-by-one set the content-type and the process.
>>
>>
>>>Would it be possible to use FIleFilter directives (even though I'm
>>>using prog / spider.pl )?  Something like:
>>>
>>>FileFIlter .gz gzip "-c '%p'"
>>>FileFIlter .zip unzip "-p '%p'"
>>>etc. for all compression/archive types?
>>
>>No, not really.  Maybe for .gz if just a single file is compressed.
>>
>>
>>>Will the files inside each archive be passed along to the next
>>>appropriate filter?  How about (unfortunate cases) where there's a .gz
>>>or .tar file inside a .zip file?  I'd like to dig as deep as possible.
>>
>>in spider.pl there's a filter content callback.  What I'd do is a
>>recursive uncompression (decompression?) into temporary directories and for each one
>>set the content-type and then call spider's output_content function.
>>
>>But what if one of the compressed files is .html.  Would you want to
>>search it for links to follow?  ;)
>>
>>BTW -- I've been planning on rewriting spider.pl for quite a while.  I
>>want to make the spider a class so that instead of having call-back
>>functions you would sub-class the spider to override its methods.
>>
>>--
>>Bill Moseley
>>moseley@hank.org
>>
>>Unsubscribe from or help with the swish-e list:
>>  http://swish-e.org/Discussion/
>>
>>Help with Swish-e:
>>  http://swish-e.org/current/docs
>>  swish-e@sunsite.berkeley.edu
>>
>>

-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
"One of the best things to come out of the home computer revolution
could be the general and widespread understanding of how severely limited logic 
really is."
- Frank Herbert (1920-1986, American Writer)
Received on Wed Feb 2 17:05:17 2005