Skip to main content.
home | support | download

Back to List Archive

Re: Using ExtractPath to Assign a Property

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Sat Dec 15 2001 - 21:21:01 GMT
At 09:17 AM 12/15/01 -0800, Bob Stewart wrote:
>One thing that might be clearer in the documentation is that it seems like after a successful match, subsequent ExtractPath statements for the same metaname are ignored.

No, that shouldn't happen.


>Say I have two paths:
>/forums/load/foo/bar.html
>/forums/foo/bar.html
>..and I want the metaname "forum" set to foo for both.
>
>This seems to be the way that works:
>
>MetaNames forum
>ExtractPath forum regex !^.*/forums/load/([^/]+)/.*$!$1!
>ExtractPath forum regex !^.*/forums/([^/]+)/.*$!$1!
>
>Even though /forums/load/foo/bar.html would match the second one, it's ignored.

First, let me explain something that's a bit odd with the regular expressions in swish.  This works with the same with all of the directives that use regular expressions (ReplaceRules, ExtractPath, (eh, something else?) ).

A list of regular expressions is maintained for each metaname used in ExtractPath.  When processing they are basically chained together -- with the *resulting* pattern from the preceding match used as the *input* pattern for the next.  If there was not match, well there's no substitution, so the original string is used again.

I think my (weak) idea was that you could break more complicated matches down into steps.  At this moment, though, I can't think of why or where this might be useful, and it has potential to screw things up if something matches that you didn't think about.

So, here's a simple script for generating the paths in your example above, so you can try it yourself.

> cat doc.pl
#!/usr/local/bin/perl -w
use strict;

my $content = 'hello';

my @paths = qw(
    /forums/load/foo/bar.html
    /forums/foo/bar.html
);
output_doc( $_ ) for @paths;

sub output_doc {
    my $path = shift;

    my $size = length $content;
    my $mtime = time;

         print <<EOF;
Content-Length: $size
Last-Mtime: $mtime
Path-Name: $path 

EOF

    print $content;
}

Here's the config:

> cat c
ExtractPath forum regex !^.*/forums/load/([^/]+)/.*$!$1!
ExtractPath forum regex !^.*/forums/([^/]+)/.*$!$1!

(ExtractPath is smart enough to add "forum" to the list of MetaNames in the current dev version).

Now, run indexing with regex tracing.  (I've turned off word wrap in my mail client).

> ./swish-e -c c -S prog -i ./doc.pl -T regex -v 0
Indexing Data Source: "External-Program"

Original String: '/forums/load/foo/bar.html'
replace /forums/load/foo/bar.html =~ m[^.*/forums/load/([^/]+)/.*$][$1]: Matched
  Result String: 'foo'
replace foo =~ m[^.*/forums/([^/]+)/.*$][$1]: No Match
  Result String: 'foo'

So note here that the first pattern matched and the result string is "foo", then you see that's it not trying the next pattern on the result of the first string where it says "replace foo".

Original String: '/forums/foo/bar.html'
replace /forums/foo/bar.html =~ m[^.*/forums/load/([^/]+)/.*$][$1]: No Match
  Result String: '/forums/foo/bar.html'
replace /forums/foo/bar.html =~ m[^.*/forums/([^/]+)/.*$][$1]: Matched
  Result String: 'foo'

Now, this on again tries both.  The first pattern doesn't match, so the input string for the next pattern still the original string.

Now run with a different trace option and you can see that it's indeed extracting out "foo":

> ./swish-e -c c -S prog -i ./doc.pl -T indexed_words -v 0      
Indexing Data Source: "External-Program"
    Adding:[1:forum(10)]   'foo'   Pos:1  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'hello'   Pos:1  Stuct:0x1 ( FILE )
    Adding:[2:forum(10)]   'foo'   Pos:1  Stuct:0x1 ( FILE )
    Adding:[2:swishdefault(1)]   'hello'   Pos:1  Stuct:0x1 ( FILE )
            ^        
          file number



>The one feature I would find most valuable is incremental indexing. Parts of my site are updated constantly and so it needs to be reindexed daily. 

Stay tuned.

Bob, a bit off topic, but are you using the libxml2 parser for HTML by chance?  It's a tiny bit slower, but much more accurate (and more feature rich).  One thing that I have found interesting is to index both with the HTML parser and with the HTML2 parser, then use -T INDEX_WORDS_ONLY to dump both indexes to text files, then run diff.  You can see how the HTML parser makes mistakes.

Everyone should really use -T index_words_only once.  It's really interesting to see what words are stored in the index.  I think that the config.h settings could be tuned much better to not index a lot of crap, specifically:

#define IGNOREROWV 60
#define IGNOREROWC 60
#define IGNOREROWN 60

Maybe this is too low by default, too.

#define MINWORDLIMIT 1



Bill Moseley
mailto:moseley@hank.org
Received on Sat Dec 15 21:21:12 2001