Skip to main content.
home | support | download

Back to List Archive

Re: searching XML documents

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Apr 02 2004 - 15:34:20 GMT
On Fri, Apr 02, 2004 at 05:25:35AM -0800, Peter Karman wrote:
> The swish docs say (under MetaNames):
> ---
> When indexing HTML Swish-e indexes the HTML title as default text, so
> when searching Swish-e will find matches in both the HTML body and the
> HTML title. Swish also, by default, indexes content of meta tags. So:
> 
>      swish-e -w foo
> 
> will find ``foo'' in the body, the title, or any meta tags.

or any meta tags that are not defined as MetaNames.

That's not very clear, so here I go....

Everything is stored as a metaname in the index.  Metanames are
basically a way to have multiple indexes in the same index file.

By default, swish indexes text under the metaname "swishdefault".  These
both search "swishdefault"

   -w foo
   -w swishdefault=(foo)

Now, by default (with no special config options) swish indexes HTML
<title>, <meta> and text extracted from <body> as swishdefault.  But, if
a meta tag is defined it is indexed under a different meta id, and
cannot be searched with just -w foo:

(Notice a bit broken HTML here, but it's still indexed.)

    moseley@laptop:~$ cat test.html
    <html>
    <head>
    <title>Title</title>
    <meta name="meta1" content="meta1text">
    <meta name="meta2" content="meta2text">
    brokentext
    <body>
    bodytext
    </body>
    </html>

    moseley@laptop:~$ cat c
    ParserWarnLevel 9
    MetaNames meta1

    moseley@laptop:~$ swish-e -v0 -i test.html -c c -T indexed_words
        Adding:[1:swishdefault(1)]   'title'   Pos:2  Stuct:0x7 ( HEAD TITLE FILE )
        Adding:[1:meta1(10)]   'meta1text'   Pos:5  Stuct:0x85 ( META HEAD FILE )
        Adding:[1:swishdefault(1)]   'meta2text'   Pos:9  Stuct:0x5 ( HEAD FILE )
    test.html:7: error: htmlParseStartTag: misplaced <body> tag
    <body>
         ^
        Adding:[1:swishdefault(1)]   'brokentext'   Pos:11  Stuct:0x9 ( BODY FILE )
        Adding:[1:swishdefault(1)]   'bodytext'   Pos:12  Stuct:0x9 ( BODY FILE )

Swish can't (or doesn't) search all metanames because of the of phrase
searches would not work right.  So, to search multiple metanames you
can do:

    -w swishdefault=(foo) OR meta=(foo)

The other option is to next the metanames:

   <name>
        <first>Bill</first>
        <last>Moseley</last>
   <name>

then with Metanames name first last you get:

    moseley@laptop:~$ swish-e -v0 -i x -c c -T indexed_words
        Adding:[1:name(10)]   'bill'   Pos:4  Stuct:0x89 ( META BODY FILE )
        Adding:[1:first(11)]   'bill'   Pos:4  Stuct:0x89 ( META BODY FILE )
        Adding:[1:name(10)]   'moseley'   Pos:7  Stuct:0x89 ( META BODY FILE )
        Adding:[1:last(12)]   'moseley'   Pos:7  Stuct:0x89 ( META BODY FILE )

and then you can search like:

    -w first=(bill)
    -w name=("bill moseley")

But, that does basically double the size of the index.

-- 
Bill Moseley
moseley@hank.org
Received on Fri Apr 2 07:34:21 2004