Skip to main content.
home | support | download

Back to List Archive

Re: metaname limit?

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Oct 08 2004 - 12:49:23 GMT
On Thu, Oct 07, 2004 at 10:04:14PM -0700, Mark Greenaway wrote

Sorry for the delay.  It got really dark on this side of the planet
for a few hours.


> Further to previous post

Thanks for the simple setup to test.  Show the commands you are
running too, and their output -- so I can follow exactly what you are
doing.

I tried it two ways, and both work fine.

First without spidering:

moseley@laptop:~$ cat c
MetaNames outputs organisation strategy domain mission hqcountry countries web email
PropertyNames outputs organisation strategy domain mission hqcountry countries web email
SwishProgParameters nacl.pl
IndexDir spider.pl

moseley@laptop:~$ cat t.html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML>
<HEAD>
<meta name="organisation" content="Site4">
<meta name="strategy" content="research education">
<meta name="domain" content="government politics law">
<meta name="outputs" content="Papers Journals Newsletters Policy Research">
<meta name="countries" content="Australia">
<meta name="hqcountry" content="Australia">
<meta name="mission" content="To influence decision makers">
<meta name="web" content="http://www.site4.org.au">
<meta name="email" content="jim@site4.org.au">
<TITLE>Site4 - confusion reigns</TITLE>
</HEAD>
<BODY>
<H1>Site4 - NACL Matrix test site</H1>
<hr>
<a href="http://incres.anu.edu.au/nacl/index.html">link</a>
<hr>
</BODY>
</HTML>

moseley@laptop:~$ swish-e -c c -i t.html -v0 -T properties                  
          swishdocpath: 6 (  6) S: "t.html"
            swishtitle: 7 ( 24) S: "Site4 - confusion reigns"
          swishdocsize: 8 (  4) N: "725"
     swishlastmodified: 9 (  4) D: "2004-10-08 05:33:21 PDT"
               outputs:19 ( 43) S: "Papers Journals Newsletters Policy Research"
          organisation:20 (  5) S: "Site4"
              strategy:21 ( 18) S: "research education"
                domain:22 ( 23) S: "government politics law"
               mission:23 ( 28) S: "To influence decision makers"
             hqcountry:24 (  9) S: "Australia"
             countries:25 (  9) S: "Australia"
                   web:26 ( 23) S: "http://www.site4.org.au"
                 email:27 ( 16) S: "jim@site4.org.au"

Ok, now spidering:

moseley@laptop:~$ swish-e -c c -S prog -v0 -T properties | grep swishtitle
/usr/local/lib/swish-e/spider.pl: Reading parameters from 'nacl.pl'

Summary for: http://incres.anu.edu.au/nacl/matrixorgs.html
Connection: Close:     2  (0.2/sec)
   Off-site links:     2  (0.2/sec)
      Total Bytes: 1,734  (133.4/sec)
       Total Docs:     3  (0.2/sec)
      Unique URLs:     3  (0.2/sec)
            swishtitle: 7 ( 33) S: "List of NACL Matrix Organisations"
            swishtitle: 7 ( 27) S: "Site1 NACL Matrix test site"
            swishtitle: 7 ( 24) S: "Site4 - confusion reigns"

Then to see what's in the index:

(swish prints things twice when dumping the index -- it's accessing
the properties using two different methods, IIRC):


moseley@laptop:~$ swish-e -T index_files | grep swishtitle             
IgnoreTotalWordCountWhenRanking must be 0 to use IDF ranking
IgnoreTotalWordCountWhenRanking must be 0 to use IDF ranking
IgnoreTotalWordCountWhenRanking must be 0 to use IDF ranking
            swishtitle: 7 ( 33) S: "List of NACL Matrix Organisations"
            swishtitle: 7 ( 33) S: "List of NACL Matrix Organisations"
            swishtitle: 7 ( 27) S: "Site1 NACL Matrix test site"
            swishtitle: 7 ( 27) S: "Site1 NACL Matrix test site"
            swishtitle: 7 ( 24) S: "Site4 - confusion reigns"
            swishtitle: 7 ( 24) S: "Site4 - confusion reigns"

(Hey, Peter -- what's that IDF warning about?)


> Even this tiny example when run has no swishtitle
> If you remove one of the metatags from site4.html then swishtitle shows up

So, what am I doing differently?

Are you, by chnace forgetting to specify the index file when dumping
the index?



-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Fri Oct 8 05:49:37 2004