Skip to main content.
home | support | download

Back to List Archive

Re: Searching only a specific div class

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Mar 12 2004 - 21:20:04 GMT
On Fri, Mar 12, 2004 at 10:58:50AM -0800, Thomas Sewell wrote:
> UndefinedMetaTags ignore
> XMLClassAttributes class # Not supported by the HTML parser?

Yes, that's a feature of the XML parser.  They are the same parser,
really, but there's just a check to see if parsing HTML and if so skip
the part that deals with XML attributes.  Might be able to modify
parser.c to make it work with HTML, too -- there's just a lot of
attributes in normal html.

I think libxml2 is more forgiving when parsing HTML, for one thing.  But
I'm not really clear on the differences in the parsers internal to
libxml2.

Now the other problem is the UndefinedMetaTags ignore is a bit too
agressive.  It ignores everything until the closing tag -- even if you
have a tag defined inbetween.  That behavior is questionable.

My suggestion is to use an program to extract out the data you want
indexed.

Anyway, here's your example:


moseley@bumby:~$ cat c
ParserWarnlevel 9
DefaultContents XML2

#UndefinedMetaTags ignore
XMLClassAttributes class # Not supported by the HTML parser?
MetaNames swishtitle swishdocpath swishdescription div.product-authors



moseley@bumby:~$ cat t.html
<html>
<DIV class="content">
<div class="product-details">
<div class="product-authors">
John Doe
</div>
</div>
<div class="product-details">
<div class="product-authors">
Jane Doe
</div>
</div>
</div>
</html>

moseley@bumby:~$ swish-e -c c -i t.html -T parsed_tags indexed_words -v0
    Adding:[1:swishdocpath(11)]   't'   Pos:1  Stuct:0x1 ( FILE )
    Adding:[1:swishdocpath(11)]   'html'   Pos:2  Stuct:0x1 ( FILE )
<html> (undefined meta name - no action)
<div> (undefined meta name - no action)
<div.content> (undefined meta name - no action)
<div> (undefined meta name - no action)
<div.product-details> (undefined meta name - no action)
<div> (undefined meta name - no action)
<div> (meta [div.product-authors])
    Adding:[1:div.product-authors(13)]   'john'   Pos:8  Stuct:0x1 ( FILE )
    Adding:[1:div.product-authors(13)]   'doe'   Pos:9  Stuct:0x1 ( FILE )
</div> (meta)
<div> (undefined meta name - no action)
<div.product-details> (undefined meta name - no action)
<div> (undefined meta name - no action)
<div> (meta [div.product-authors])
    Adding:[1:div.product-authors(13)]   'jane'   Pos:16  Stuct:0x1 ( FILE )
    Adding:[1:div.product-authors(13)]   'doe'   Pos:17  Stuct:0x1 ( FILE )
</div> (meta)
t.html:13: error: Opening and ending tag mismatch: DIV line 0 and div
</div>
      ^

You could then search like:

   moseley@bumby:~$ swish-e -w 'div.product-authors=jane' -H0
   1000 t.html "t.html" 210

-- 
Bill Moseley
moseley@hank.org
Received on Fri Mar 12 13:20:05 2004