Skip to main content.
home | support | download

Back to List Archive

Re: Can't search metaname derived from ExtractPath

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Apr 14 2004 - 15:36:10 GMT
On Wed, Apr 14, 2004 at 06:38:35AM -0700, Thomas Dowling wrote:
> I'm trying to index XML files (Dublin Core pulled out of an OAI 
> harvester, to be precise), and would like to be able to search for a 
> unique identifier based on the Path header, which in turn is based on 
> the OAI identifier.  My config file includes:
> 
>   ExtractPath docnum regex ![-:/\]!_!g

Where Path is the path name to the document.

You can try -T regex index_words to see what it's doing.  But you are
replacing that set of chars with an underscore, which isn't in
swishwords so it's not really doing anything other than indexing the
path.

Not the result string:

Original String: '/home/moseley/apache/test.txt'
replace /home/moseley/apache/test.txt =~ m[[-:/\]][_]: Matched
replace home/moseley/apache/test.txt =~ m[[-:/\]][_]: Matched
replace moseley/apache/test.txt =~ m[[-:/\]][_]: Matched
replace apache/test.txt =~ m[[-:/\]][_]: Matched
replace test.txt =~ m[[-:/\]][_]: No Match
  Result String: '_home_moseley_apache_test.txt'
    Adding:[1:docnum(10)]   'home'   Pos:1  Stuct:0x1 ( FILE )
    Adding:[1:docnum(10)]   'moseley'   Pos:2  Stuct:0x1 ( FILE )
    Adding:[1:docnum(10)]   'apache'   Pos:3  Stuct:0x1 ( FILE )
    Adding:[1:docnum(10)]   'test'   Pos:4  Stuct:0x1 ( FILE )
    Adding:[1:docnum(10)]   'txt'   Pos:5  Stuct:0x1 ( FILE )

It's really made more to extract out part of the path, for example to
pull out the part after /home/moseley/:

ExtractPath docnum regex !^/home/moseley/([^/]+).*$!$1!

original String: '/home/moseley/apache/test.txt'
replace /home/moseley/apache/test.txt =~ m[^/home/moseley/([^/]+).*$][$1]: Matched
  Result String: 'apache'
      Adding:[1:docnum(10)]   'apache'   Pos:1  Stuct:0x1 ( FILE )




-- 
Bill Moseley
moseley@hank.org
Received on Wed Apr 14 08:36:11 2004