Skip to main content.
home | support | download

Back to List Archive

Re: I am trying to index only <div id="content">

From: Matthew Slocum <Mslocum(at)not-real.bju.edu>
Date: Mon Mar 08 2004 - 21:44:24 GMT
>I'd use -S prog and use either HTML::Parser or HTML::TreeBuilder to
>extract out that content.
I use -S when I run the spider.  Where do I use the regular expression or HTML::Parser?  

Matt Slocum

>>> Bill Moseley <moseley@hank.org> 03/08/04 04:35PM >>>
On Mon, Mar 08, 2004 at 01:15:44PM -0800, Matthew Slocum wrote:
> I am trying to index only &lt;div id="content"&gt;
> I think it is giving me all the div tags.
> 
> in swish.conf:
> StoreDescription HTML "&lt;div id=\"content\"&gt;"

No that won't work, sorry.

I'd use -S prog and use either HTML::Parser or HTML::TreeBuilder to
extract out that content.

You might be able to use a regular expression extract out the content,
although using regular expressions to parse HTML can be hard.  But that
would be much faster than HTML::Parser or HTML::TreeBuilder.

-- 
Bill Moseley
moseley@hank.org 
Received on Mon Mar 8 13:44:24 2004