When I search by the 'name' metaname, I get this:
~/local/bin/swish-e -f ~/local/oldarchives.index -w "name=ostrowsky"
# SWISH format: 2.4.5
# Search words: name=ostrowsky
# Removed stopwords:
err: Unknown metaname: 'name'
.
I indexed the source with this command:
~/local/bin/swish-e -S prog -c ~/local/oldarchives.conf
~/local/oldarchives.conf contains these non-comment lines:
IndexFile /home/users/c/codi/local/oldarchivesmeta.index
SwishProgParameters /home/users/c/codi/local/spideroldarchives.conf
IndexDir /home/users/c/codi/local/lib/swish-e/spider.pl
MetaNames name email date subject
PropertyNames name email date subject
/home/users/c/codi/local/spideroldarchives.conf contains these
non-comment lines:
my ($filter_sub, $response_sub ) = swish_filter();
@ servers = ({
base_url => 'http://username:password@host/path/index.html',
agent => 'swish-e spider http://swish-e.org/',
email => 'ben@benostrowsky.com',
filter_content => $filter_sub,
test_url => sub {
if (($_[0]->path =~ /narchive/) && !($_[0]->path =~ /\.txt$/)) {
return 1; }
return 0;
},
test_response => sub {
my $server = $_[1];
$server->{no_index}++ if
$_[0]->path =~ /[author|thread|subject|date|maillist|threads].html/;
return 1;
},
ignore_robots_file => 1,
delay_sec => 2, # Delay in seconds between requests
keep_alive => 1, # enable keep alives requests
use_cookies => 1,
debug => "url, info, headers"
}
);
1;
When I run swish-filter-test -verbose -content
http://username:password@host/path/document.html, I get the correct
document in return, with the metadata inserted in the <head> element
just as I intend:
<!-- MHonArc v2.6.8 -->
<!--X-Subject: RE: [HORIZON-L] RPA ... -->
<!--X-From-R13: "Prgu Yebruyre" <oxebruyreNzhacy.bet> -->
<!--X-Date: Mon, 7 Mar 2005 12:16:35 -0700 (MST) -->
<!--X-Message-Id: F4F5AB8DD4D5EE498C36D38DF809BEF6454078@exchange.munpl.org -->
<!--X-Content-Type: multipart/mixed -->
<!--X-Head-End-->
<!doctype html public "-//W3C//DTD HTML//EN">
<html>
<head>
<meta name="date" content="Mon, 7 Mar 2005 14:16:15 -0500" />
<meta name="email" content="bkroehler@munpl.org" />
<meta name="name" content="Beth Kroehler" />
<meta name="subject" content="RE: [HORIZON-L] RPA ..." />
<title>RE: [HORIZON-L] RPA ...</title>
<link rev="made" href="mailto:bkroehler@munpl.org">
</head>
<body>
I'm not sure how to verify whether the spider is actually invoking the
filter. The end of the output looks like this:
?Testing 'filter_content' user supplied function #1
'http://www.codi.org/archives/narchive/2003/msg05185.html'
+Passed all 1 tests for 'filter_content' user supplied function
?Testing 'test_response' user supplied function #1
'http://www.codi.org/archives/narchive/2003/msg05184.html'
+Passed all 1 tests for 'test_response' user supplied function
>> +Fetched 2 Cnt: 8540 GET
http://www.codi.org/archives/narchive/2003/msg05184.html 200 OK
text/html 4475 parent:http://www.codi.org/archives/narchive/2003/maillist.html
depth:2
?Testing 'test_url' user supplied function #1
'http://www.codi.org/archives/narchive/2003/msg05183.html'
+Passed all 1 tests for 'test_url' user supplied function
?Testing 'test_url' user supplied function #1
'http://www.codi.org/archives/narchive/2003/msg05185.html'
+Passed all 1 tests for 'test_url' user supplied function
?Testing 'test_url' user supplied function #1
'http://www.codi.org/archives/narchive/2003/msg05182.html'
+Passed all 1 tests for 'test_url' user supplied function
?Testing 'test_url' user supplied function #1
'http://www.codi.org/archives/narchive/2003/msg05186.html'
+Passed all 1 tests for 'test_url' user supplied function
?Testing 'test_url' user supplied function #1
'http://www.codi.org/archives/narchive/2003/maillist.html'
+Passed all 1 tests for 'test_url' user supplied function
?Testing 'test_url' user supplied function #1
'http://www.codi.org/archives/narchive/2003/threads.html'
+Passed all 1 tests for 'test_url' user supplied function
External Program found: /home/users/c/codi/local/lib/swish-e/spider.pl
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 36,606 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: Complete
Writing word hash: Complete
Writing word data: Complete
36,606 unique words indexed.
8 properties sorted.
8,527 files indexed. 66,651,072 total bytes. 5,456,505 total words.
Elapsed time: 00:16:03 CPU time: 00:01:25
Indexing done!
So what am I forgetting? Is the spider actually invoking the filter?
If so, what else do I need to do in order for it to index the
metadata?
Thanks!
Ben
--
"Don't get suckered in by the comments;
they can be terribly misleading.
Debug only code." -- Dave Storer
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Tue Sep 18 14:51:37 2007