Skip to main content.
home | support | download

Back to List Archive

Re: Merged words from XML tables

From: Peter Karman <karman(at)not-real.cray.com>
Date: Wed Oct 27 2004 - 15:11:03 GMT
I just tried this with 2.5.2 and it appears to split on the tags:

karpet@cartermac 271% swish-e -i test.xml -v 3 -c c
Parsing config file 'c'
Indexing Data Source: "File-System"
Indexing "test.xml"

Checking file "test.xml"...
   test.xml - Using XML2 parser -  (3 words)

...

karpet@cartermac 272% swish-e -T index_all


-----> WORD INFO in index index.swish-e <-----

559999
  Meta:1 test.xml Freq:1 Pos/Struct:5/1

some
  Meta:1 test.xml Freq:1 Pos/Struct:10/1

text
  Meta:1 test.xml Freq:1 Pos/Struct:11/1


karpet@cartermac 273% cat test.xml
<row><entry><para>559999</para></entry><entry><para>Some 
text</para></entry></row>

karpet@cartermac 274% cat c
WordCharacters 0123456789abcdefghijklmnopqrstuvwxyz._-/$
BeginCharacters 0123456789abcdefghijklmnopqrstuvwxyz._-/
EndCharacters 0123456789abcdefghijklmnopqrstuvwxyz_-/
MinWordLimit 1
IndexContents HTML* .html
IndexContents XML* .xml


Peter Karman wrote on 10/27/04 9:16 AM:

> looks like your word position isn't getting incremented?
> 
> have you tried a newer release? 2.4.0pr1 is old and that may be fixed in 
> a newer version (it was "pre release" after all).
> 
> Stein-Egil Museus wrote on 10/27/04 1:43 AM:
> 
> 
>>Hi
>>
>>I try to index some xml files with tables with swish-e 2.4.0.pr1, and get the following erroneous output.
>>
>>Here are a fragment of a XML file:
>>
>><row><entry><para>559999</para></entry><entry><para>Some text</para></entry><row>
>>
>>This gives the index words '559999Some' and 'text' in the index.
>>
>>My config file look like this
>>
>>IndexContents HTML* .htm .html .shtml
>>
>>IndexContents XML* .xml
>>
>>IndexDir ./
>>
>>IndexOnly .html .htm .xml
>>
>>IndexFile ./text.index
>>
>>What is wrong?
>>
>>/Stein-Egil
>>
>>
>>
>>
>>*********************************************************************
>>Due to deletion of content types excluded from this list by policy,
>>this multipart message was reduced to a single part, and from there
>>to a plain text message.
>>*********************************************************************
> 
> 

-- 
Peter Karman  .  http://www.cray.com/craydoc/ .  karman(at)not-real.cray.com
Received on Wed Oct 27 08:11:04 2004