Skip to main content.
home | support | download

Back to List Archive

Re: Searching only a specific div class

From: Peter Karman <karman(at)>
Date: Fri Mar 12 2004 - 20:54:14 GMT
Thomas Sewell supposedly wrote on 3/12/04 1:39 PM:

> I added the following to my config:
> PropertyNamesIgnoreCase swishtitle swishdocpath swishdescription
> div.product-authors

I think swishtitle already ignores case by default. swishdocpath should 
probably RESPECT case, since you might have file foo.html and FOO.html 
and they are different files. So I would not include those two if it 
were me.

> I also tried adding:
> DefaultContents XML*

I wasn't clear from your post whether you are using the XML or XML2 
parser, since I don't know whether you compiled swish-e with libxml2 or not.

XML* doesn't tell me which parser is at work.

> Perhaps I just need to figure out the correct way to convince swish-e
> to select the right data from the page to fill in the
> div.product-authors property?

that is more likely. I use the -s prog option and a filter to add meta 
data to my HTML, and then use the Properties feature that way.

My experience is, any time you want to garner more information than the 
basic text in an HTML doc (like, what the text might MEAN, for example), 
then I use the -s option, since I can then control exactly what swish-e 


> Thanks,
> Thomas
> -----Original Message----- From: Peter Karman
> [] Sent: Friday, March 12, 2004 2:22 PM To:
> Multiple recipients of list Subject: [SWISH-E] Re: Searching only a
> specific div class
> this seems related to an earlier post this week. make sure you
> declare div.product-authors as a PropertyName as well as a MetaName.
> However, I still don't think that's going to help you. I'm not sure
> that the HTML parser (even when using libxml2, so the HTML2 parser)
> is smart enough to recognize tags in the <body>. I think it only
> works with <meta> tags in the <head>.
> You might have better luck using the XML2 parser in your config,
> which should treat the tags as XML instead of HTML, and thus
> recognize your special tagset.
> But Bill will probably give you a better answer than this.
> pek
> Thomas Sewell supposedly wrote on 3/12/04 1:00 PM:
>> I have a site that is structured in html with multiple items per
>> page, with sets of information about each item deliminated by div
>> tags with a descriptive class attribute.
>> Shortened Example: <DIV class="content"> <div
>> class="product-details"> <div class="product-authors"> John Doe 
>> </div> </div> <div class="product-details"> <div
>> class="product-authors"> Jane Doe </div> </div> </div>
>> Currently I am just indexing the full text of the page and the
>> default swish properties for each page. The source is html, so I
>> assume it's defaulting to use the HTML parser.
>> I would like to make a search available to search just the contents
>> of the "Author" div's, for example.
>> I've been trying to define and use a property for the Author class,
>> but without success.
>> I think I need to use some combination of metanames in the index
>> config file and in the search cgi, but I've been unable to figure
>> out the exact format to use.
>> I assume it's going to be something along the lines of:
>> UndefinedMetaTags ignore XMLClassAttributes class # Not supported
>> by the HTML parser? MetaNames swishtitle swishdocpath
>> swishdescription div.product-authors
>> in the index config file.
>> Is this possible? Would I have to convert to strict xhtml in order
>> to use the XML parser to use the class attribute as a
>> property/metatag? Or am I missing something else?
>> What occurs when I try the above is that the index appears to work
>> (it reports "4 properties sorted." without any errors), but the
>> search script returns "Unknown property name to sort by: Property
>> 'div.authors' is not defined in index '<my index file>'" when I try
>> to search by div.authors.
>> Anyone have an example of something like this working?
>> Thanks for any help,
>> Thomas Sewell

Peter Karman - Software Publications Engineer - Cray Inc
phone: 651-605-9009 -
Received on Fri Mar 12 12:54:14 2004