Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] "-S prog" mashing up words in HTML files

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Mar 21 2007 - 00:54:41 GMT
On Tue, Mar 20, 2007 at 07:19:42PM -0500, Matthew Stanislawski wrote:
> Here's the output of my script, for this particular document:
> http://mattstan.net/spew.out

Do you see problems when you run it like this:

    cat spew.out | swish-e -i stdin -S prog -T indexed_words -c c

See my output below.

> Getting a lot of errors like this:
> 
> https://opcenter-test.cso.uiuc.edu/doc/DOORCODES:33: error: Entity 
> 'nbsp' not defined

Well, it's an XML doc and you have to define the entities, I suspect.

You probably should be using the XML2 parser, too.

    DefaultContents XML2


> Given the first error above, that's what it looks like.  I didn't 
> install libxml2 myself, however, so I don't exactly know my way around 
> it.  How can I troubleshoot libxml2 directly?

Can you check what version you have installed?

I'm on Debian:

moseley@bumby:~/WS2/lib/WS2/C$ apt-cache policy aspell
aspell:
  Installed: 0.60.5-1
  Candidate: 0.60.5-1
  Version table:
 *** 0.60.5-1 0
        500 http://128.101.240.212 unstable/main Packages
        100 /var/lib/dpkg/status

You can compare with this:

moseley@bumby:~$ cat spew.out | swish-e -i stdin -S prog -T indexed_words -c c
Indexing Data Source: "External-Program"
Indexing "stdin"
    Adding:[1:swishdocpath(14)]   'https'   Pos:1  Stuct:0x1 ( FILE )
    Adding:[1:swishdocpath(14)]   'opcent'   Pos:2  Stuct:0x1 ( FILE )
    Adding:[1:swishdocpath(14)]   'test'   Pos:3  Stuct:0x1 ( FILE )
    Adding:[1:swishdocpath(14)]   'cso'   Pos:4  Stuct:0x1 ( FILE )
    Adding:[1:swishdocpath(14)]   'uiuc'   Pos:5  Stuct:0x1 ( FILE )
    Adding:[1:swishdocpath(14)]   'edu'   Pos:6  Stuct:0x1 ( FILE )
    Adding:[1:swishdocpath(14)]   'doc'   Pos:7  Stuct:0x1 ( FILE )
    Adding:[1:swishdocpath(14)]   'doorcod'   Pos:8  Stuct:0x1 ( FILE )
    Adding:[1:source(20)]   'doc'   Pos:3  Stuct:0x1 ( FILE )
    Adding:[1:source(20)]   '5'   Pos:4  Stuct:0x1 ( FILE )
    Adding:[1:owner(21)]   'monnin'   Pos:9  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   '1170455836'   Pos:13  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'opcent'   Pos:16  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'door'   Pos:17  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'system'   Pos:18  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'door'   Pos:19  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'inform'   Pos:20  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   'door'   Pos:22  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   'name'   Pos:23  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   'connect'   Pos:24  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   'room'   Pos:25  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   'access'   Pos:26  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   'via'   Pos:27  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   'sensor'   Pos:28  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   'descript'   Pos:29  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   'm1'   Pos:30  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   '1452'   Pos:31  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   'and'   Pos:32  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   '1440'   Pos:33  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   'prox'   Pos:34  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   'card'   Pos:35  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   'yes'   Pos:36  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   'glass'   Pos:37  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   'doubl'   Pos:38  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   'door'   Pos:39  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   'between'   Pos:40  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   'opcent'   Pos:41  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   'and'   Pos:42  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   'datacent'   Pos:43  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   'main'   Pos:44  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   'entranc'   Pos:45  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   'into'   Pos:46  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   'datacent'   Pos:47  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   'm2'   Pos:48  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   '1440'   Pos:49  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   'and'   Pos:50  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   '1420'   Pos:51  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   'prox'   Pos:52  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   'card'   Pos:53  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   'yes'   Pos:54  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   'metal'   Pos:55  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   'singl'   Pos:56  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   'door'   Pos:57  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   'between'   Pos:58  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   'datacent'   Pos:59  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   'and'   Pos:60  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   'help'   Pos:61  Stuct:0x1 ( FILE )
    Adding:[1:excerpt(23)]   'des'   Pos:62  Stuct:0x1 ( FILE )
    Adding:[1:details(15)]   'request'   Pos:69  Stuct:0x1 ( FILE )
    Adding:[1:details(15)]   'review'   Pos:70  Stuct:0x1 ( FILE )
    Adding:[1:details(15)]   'main'   Pos:75  Stuct:0x1 ( FILE )
    Adding:[1:details(15)]   'tool'   Pos:78  Stuct:0x1 ( FILE )
    Adding:[1:details(15)]   'doormon'   Pos:81  Stuct:0x1 ( FILE )
    Adding:[1:details(15)]   'opcent'   Pos:85  Stuct:0x1 ( FILE )
    Adding:[1:details(15)]   'door'   Pos:86  Stuct:0x1 ( FILE )
    Adding:[1:details(15)]   'system'   Pos:87  Stuct:0x1 ( FILE )
    Adding:[1:details(15)]   'door'   Pos:88  Stuct:0x1 ( FILE )
    Adding:[1:details(15)]   'inform'   Pos:89  Stuct:0x1 ( FILE )
    Adding:[1:details(15)]   'door'   Pos:94  Stuct:0x1 ( FILE )
    Adding:[1:details(15)]   'door'   Pos:101  Stuct:0x1 ( FILE )
    Adding:[1:details(15)]   'name'   Pos:102  Stuct:0x1 ( FILE )
    Adding:[1:details(15)]   'connect'   Pos:107  Stuct:0x1 ( FILE )
    Adding:[1:details(15)]   'room'   Pos:108  Stuct:0x1 ( FILE )
https://opcenter-test.cso.uiuc.edu/doc/DOORCODES:33: error: Entity 'nbsp' not defined
body><tr><td>Door Name<br/></td><td>Connects Rooms<br/></td><td>Access Via&nbsp;
                                                                               ^
    Adding:[1:details(15)]   'access'   Pos:112  Stuct:0x1 ( FILE )
    Adding:[1:details(15)]   'via'   Pos:113  Stuct:0x1 ( FILE )

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Tue Mar 20 20:54:41 2007