Skip to main content.
home | support | download

Back to List Archive

Re: categorizing information

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu May 13 2004 - 14:49:22 GMT
On Thu, May 13, 2004 at 02:53:26AM -0700, Jonas Wolf wrote:
> <xml>
>   <top>
>     <abstract>An Abstract</abstract>
>     <descript>A description</descript>
>     ... lots of other meta tags like this
>   </top>
> </xml>
> 
> Using this, I can trigger a search using 'top=whatever', and it gives me 
> all the results I want. And I can still use 'abstract=whatever' to 
> restrict the search to the <abstract> tags.

Seems like you could do the same with xml=whatever and use <xml> as your
top-level tag.

> Is this the best solution for the problem, or is there another way to do 
> this?

Is the "problem" how to use MetaNamesRank?

Unfortunately, it won't work with MetaNamesRank.  The problem is when
you have nested meta names swish indexes the text for each meta name
separately.  That is, when you search for top=whatever swish isn't smart
enough to then look in all the nested metanames, it just looks in the
top metaname -- and uses the "top" MetaNamesRank.

So, for example, if "top" and "abstract" have different MetaNamesRank
settings a search for top=foo and abstract=foo would result in different
rankings.  Sucks, but that's the way things worked out.

One way around that might be to change "top=foo" into 
"abstract=(foo) OR descript=(foo) OR ....".

Here's what I have done in the past.  When indexing HTML swish maintains
a "structure" for each word which tells swish where that word appeared
-- like in <title> or <h1> or <b> -- and that is used for adjusting the
rank of each word.  So you can take advantage of that to create your
document:

<html>
<head>
<title>TitleText</title>
</head>
<body>
    <top>
       <abstract><h1>An Abstract</h1></abstract>
       <descript>A description</descript>
    </top>
</body>
</html>

Notice the use of fake html tags as meta names.  That <h1> will flag
those words as HEADING in the structure and will adjust the rank of
those words.  rank.c does the ranking.

The bias for different structure values is in config.h:

config.h:#define RANK_TITLE             7  // <title>
config.h:#define RANK_HEADER            5  // <h*>
config.h:#define RANK_META              3  // <meta> or any <tag>
config.h:#define RANK_COMMENTS          1  // <!-- comment -->
config.h:#define RANK_EMPHASIZED        0  // <em> <b> <strong> <i>

Using -T indexed_words shows the structure on each word:

moseley@bumby:~$ swish-e -c c -i 1.html -T indexed_words -v0
    Adding:[1:swishdefault(1)]   'titletext'   Pos:5  Stuct:0x7 ( HEAD TITLE FILE )
    Adding:[1:top(10)]   'an'   Pos:14  Stuct:0xA9 ( HEADING META BODY FILE )
    Adding:[1:abstract(11)]   'an'   Pos:14  Stuct:0xA9 ( HEADING META BODY FILE )
    Adding:[1:top(10)]   'abstract'   Pos:15  Stuct:0xA9 ( HEADING META BODY FILE )
    Adding:[1:abstract(11)]   'abstract'   Pos:15  Stuct:0xA9 ( HEADING META BODY FILE )
    Adding:[1:top(10)]   'a'   Pos:19  Stuct:0x89 ( META BODY FILE )
    Adding:[1:descript(12)]   'a'   Pos:19  Stuct:0x89 ( META BODY FILE )
    Adding:[1:top(10)]   'description'   Pos:20  Stuct:0x89 ( META BODY FILE )
    Adding:[1:descript(12)]   'description'   Pos:20  Stuct:0x89 ( META BODY FILE )

So "tittletext" is flagged as being in both <head> (HEAD)  and <title>
(TITLE), and "abstract" is flagged as <h*> (HEADING), and in a metaname
(<top>) and also in the <BODY>.  Everything has the FILE bit set.

Not a great system, but it does give you a way to adjust ranking on
specific words.  There's eight bits in the "structure", so it's
possible, I suppose to get 256 different ranking levels for your use.
Well, everything has the FILE bit set for some reason -- not sure if
that's a requirement in the code or not -- but that might mean you would
only have 7 bits and 128 rank levels.

That would take a little source code hacking, but not much.















-- 
Bill Moseley
moseley@hank.org
Received on Thu May 13 07:49:24 2004