Skip to main content.
home | support | download

Back to List Archive

Re: (Re)Definition of swishdefault

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Aug 28 2002 - 15:21:39 GMT
Sorry, I'm rotten at giving short answers...

At 05:47 AM 08/28/02 -0700, Guido Adam wrote:
>All tags are defined as meta-tags in the swish.conf:
>
>	MetaNames document url size type date crawldate keywords \
>		description link title content
>
>Problem:
>If I search, I have to do something like
>
>	swish-e -f index_test -w "content=harry"
>
>I'd like to do
>
>	swish-e -f index_test -w "harry"

Note that that is the same thing as

       swish-e -f index_test -w swishdefault=harry

That means your front-end code can be more generic.  Since *everything* is
a metaname you can always specify a metaname and that will make it easier
to program:

	$swish_query = "$metaname=($query_words)";

>Is it possible to "define" <content> as swishdefault, <title> as 
>swishtitle, <url> as swishdocpath and <description> as swishdescription? If 
>so, how to do that?

I think you are mixing some concepts here.  Or at least you are asking two
questions.

Swish has properties and metanames.  Metanames are used for searching,
where properties are used to store associated data with each file.  It's
kind of backwards as properties are really metadata.

So, you can alias the meta names while indexing:

  http://swish-e.org/2.2/docs/SWISH-CONFIG.html#item_MetaNameAlias

Remove "content" from MetaNames and instead add it as:

  MetaNameAlias swishdefault content

Then searching  ./swish -w foo will find "foo" even if it was in the tag
<content>.  Use the -T indexed_words option to index a single document and
you can see how it works.

Now, the other tags you list above sound more like properties.  So then you
would use PropertyNameAlias instead.

So I think those are your answers.

If you are not *mixing* indexing of HTML and XML docs, then there's no need
to map (alias) your tag names onto the default propertynames that swish
uses.  Just use your names and use -x to get out the data you want.

That's how swish works internally.  It just uses a default -x setting of: 

    "r %p \"%t\" %l"

which is in long form:

    -x '<swishrank> <swishdocpath> "<swishtitle>" <swishdocsize>\n'


Now, the "title" is a special case, and I'm not really sure what you want
to do.  I try to explain below.

>The index contains xml data only.

Just to be clear, HTML and XML parsing are basically the same.  There's
three differences.

1) HTML tags are not added when using "UndefinedMetaTags auto".
"UndefinedMetaTags auto" might be useful when you are indexing XML and want
every tag to be automatically created as a Metaname.  (My guess is this is
not that useful of a configuration setting.)

2) HTML tags set flags on the word indicating *where* in the HTML doc a
word is found, such as in the <head>, <title>, <body>, <strong|b|em|i>,
<h*>.  These flags do two things.  First, they can be used with the -t
switch to limit searches to words in those sections of a document (anyone
use that feature?)  Second, the flags are used in ranking to rank some
words higher than others, most commonly title words are ranked higher than
body words.

(BTW -- that flag is called the word's "structure")

3) Text in HTML <title> tags are indexed as swishdefault, so you
automatically search the title in addition to the body of the document.

The MetaNameAlias thing happens after processing HTML tags.  So, although
you can do:

  MetaNamesAlias swishdefault title

and get your <title> indexed as swishdefault, it will *not* have the flags
to indicate that it is a title word and rank higher in search results.

One plan is to be able to set a ranking bias by metaname so that you could
say, rank words in <keywords>...</keywords> higher.  But that doesn't solve
the problem of indexing <title> as swishdefault, and also making those
words rank higher.  Plus, that won't work for aliases since alias mapping
happens at indexing and rank calculation is done at search time.

You can't make the parser assign the flags by simply indexing your .xml
files as type HTML2 because the tag mapping doesn't happen in the parser --
that is, the parser doesn't rename the tags before swish sees them (the
mapping happens when swish lookups up the tags ID number).  You wouldn't
want that because then you couldn't have separate alias mappings for
metanames and property names.

It might be possible to have a (yet another) config option that allows you
to set the flags on tags.  Something like:

  MetaNameAlias swishdefault title
  StructureFlags title in_title in_head

So that would emulate what happens when processing HTML.  Words in a
<title> tag get indexed as swishdefault metaname, plus those words are
flagged as being title words (and in the <head> section, too).

If I wanted that behavior today I'd use -S prog and write a perl program to
parse my XML and output HTML.

Ok, time for another cup of coffee.....


-- 
Bill Moseley
mailto:moseley@hank.org
Received on Wed Aug 28 15:25:08 2002