Skip to main content.
home | support | download

Back to List Archive

Re: switched to server, still no luck (almost)

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Oct 25 2005 - 17:42:36 GMT
On Tue, Oct 25, 2005 at 10:11:54AM -0700, J. David Boyd wrote:
> > Perhaps it's indexed under a different metaname?  We can only guess
> > since you are not providing any examples that we can reproduce.
> > 
> 
> Hmm, how would I tell?

Well, this is what I'd do:

"Hum, I can't find "routable" but I'm sure it's in file "Foo.html"
that I'm indexing.  Ok, let me step back a bit.  First, I'll index
just that one file and look at what words are indexed:

    swish-e -i Foo.html -T indexed_words | grep routable 
        Adding:[1:swishdefault(1)]   'routable'   Pos:40  Stuct:0x9 ( BODY FILE )

Ok, so I see it's being index as "swishdefault".  (But if it wasn't
then I'd go in and start hacking away at Foo.html to see why -- and
also enable ParserWarnLevel to see if the parser will find anything
wrong.)

Now, index the same file using my config and look for routable:

    swish-e -i Foo.html -T indexed_words -c config | grep routable

And if it does show up you will know what metaname.  If it doesn't
show up then you know there's something in the config that's making it
not show up.  Start commenting out lines in the config until you see
it vanish."


> I'm in the ~/share/doc/swish-e/examples/conf directory,
> and I'm running
> 
> swish-e -S prog -c example9.config

Those examples are really there to walk you through various ways of
doing things wish swish.

That will work, but I'd probably use a different method for parsing
the pdf files.  spider.pl will automatically filter for you by
default.  Or there's a program called DirTree.pl that walks a
directory tree (instead of using a web server).

You can also use swish-filter-test to index one file for testing:

$ swish-filter-test -content -headers -quiet  test.pdf | swish-e -S prog -i stdin
Indexing Data Source: "External-Program"
Indexing "stdin"
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 1,547 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
1,547 unique words indexed.
4 properties sorted.                                              
1 file indexed.  43,751 total bytes.  6,870 total words.
Elapsed time: 00:00:02 CPU time: 00:00:00
Indexing done!

Or maybe:

$ swish-filter-test -content -headers -quiet  test.pdf | swish-e -S prog -i stdin -v0 -T indexed_words | grep recommended 
    Adding:[1:swishdefault(1)]   'recommended'   Pos:1809  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'recommended'   Pos:2887  Stuct:0x9 ( BODY FILE )


> /usr/home/tsc0/public_html/add/MOD_0/AAA-MOD0.TBL.pdf - Using XML parser
> - !!!Adding automatic MetaName 'all' found in file

I find little use for "auto" metanames.  But, that's likely your
problem.

> 
> Warning: XML parse error in file
> '/usr/home/tsc0/public_html/add/MOD_0/AAA-MOD0.TBL.pdf' line 18.  Error:
> not well-formed

That's a bit odd.  Maybe something isn't being escaped correctly or an
odd encoding error.  Something to look at later.

> Then, like I said,
> 
> swish-e -T index_all_words shows me all the words I am looking for, but
> I can't get one with the "-w".
> 
> I thought that a 'swish-e -w WORD' would be the least restrictive kind
> of search...

That would be incorrect assumption.  Swish doesn't work that way.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Tue Oct 25 10:42:36 2005