Skip to main content.
home | support | download

Back to List Archive

Re: Excel Parser with spider.pl

From: <moseley(at)not-real.hank.org>
Date: Wed Aug 27 2003 - 14:03:32 GMT
On Wed, Aug 27, 2003 at 05:09:13AM -0700, Bucharow Leonard wrote:

> I might be stupid or lazy or don't have much perl experience yet, but I
> cannot bring Swih::Filters::XLtoHTML in SwishSpiderConfig.pl to run.

You didn't show how you were trying, but SWISH::Filters::* filters can't 
be used separately without the SWISH::Filter framework.

So if you are mentioning SWISH::Filters::XLtoHTML anywhere in your
spider config it's not going to work.

The idea is you just use SWISH::Filter in the spider config.  You create
a SWISH::Filter object (by calling new) and then you (for each document)
call SWISH::Filter->filter() passing it a reference to your document and
its content type.  Then SWISH::Filter passes that data to each filter in
order until some filter (perhaps XLtoHTML) says "Hey! I can filter that"
and returns the filtered content back to SWISH::Filter and then back to
your program (spider.pl in this case). 


> Can you please send me your SpiderConfig.pl file to compare, how you use
> XLtoHTML filter?

There's an example in the included SwishSpiderConfig.pl file of using 
SWISH::Filter.


> What do you mean, that I would want to spider PHP instead using -S fs
> method. How can I spider server created sites? I mean that several files on
> the web server sind reachable dynamically, but not static.

In general, if you are indexing a web site that is anything more than 
just static pages you will probably want to spider.

Dynamically created sites are created, well, dynamically by the web 
server (and related tools such as PHP) when the page is requested.  
So to get the document like you would expect the web server to display 
it you need to fetch it from the web server, not from the file system.

/var/www/index.php might not look anything like
http://localhost/index.php.  /var/www/index.php might not have all the
content included that you want to index as it might include text from 
many sources -- but still text you want indexed as coming from 
"index.php" since that where the text shows up when looking at the file 
with a web server.

If what you mean by "dynamic" is "user specific" then you probably still 
need to spider, but need to somehow filter out user specific data.

Another example is with rewrites -- you might fetch 
http://localhost/foo.html but that really reads bar.html due to rewrites 
in the web server's config.

Or, http://localhost/foo/user/list my be running a CGI/PHP script called 
"foo" and using /user/list as CGI parameters.


-- 
Bill Moseley
moseley@hank.org
Received on Wed Aug 27 14:05:09 2003