Skip to main content.
home | support | download

Back to List Archive

Filtering Excel files

From: Bucharow Leonard <Leonard.Bucharow(at)not-real.DLE-M.Bayern.de>
Date: Thu Sep 04 2003 - 11:32:14 GMT
Hi all,

I don't give up trying indexing Excel files and I've tried now to index with
-S fs and swish_filter.pl.
It seems that only PDF is parsed with swish_filter.pl. Debugging output see
below.

With spider.pl it seems to be a similar problem.

What am I doying wrong? What am I missing? Have anybody any solution?

Thanks for your help
Leo

swish@bza141:~> swish-e -S prog -c /home/swish/swish-e/conf/swish.conf.test
-T indexed_words
Indexing Data Source: "External-Program"
Indexing "spider.pl"
/home/swish/swish-e/lib/swish-e/spider.pl: Reading parameters from
'/home/swish/swish-e/conf/SpiderConfigTest.pl'

 -- Starting to spider: http://bza141/test/excel.xls --
>> +Fetched 0 Cnt: 1 http://bza141/test/excel.xls 200 OK
application/vnd.ms-excel 13824 parent:
http://bza141/test/excel.xls - Using HTML2 parser -
Adding:[1:swishdocpath(11)]   'http'   Pos:1  Stuct:0x1 ( FILE )
    Adding:[1:swishdocpath(11)]   'bza141'   Pos:2  Stuct:0x1 ( FILE )
    Adding:[1:swishdocpath(11)]   'test'   Pos:3  Stuct:0x1 ( FILE )
    Adding:[1:swishdocpath(11)]   'excel'   Pos:4  Stuct:0x1 ( FILE )
    Adding:[1:swishdocpath(11)]   'xls'   Pos:5  Stuct:0x1 ( FILE )

Summary for: http://bza141/test/excel.xls
Total Bytes: 13,824  (13824.0/sec)
 Total Docs:      1  (1.0/sec)
Unique URLs:      1  (1.0/sec)
    Adding:[1:swishdefault(1)]   ''   Pos:2  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   ''   Pos:3  Stuct:0x9 ( BODY FILE )
 (2 words)


swish@bza141:~> swish-e -S fs -c /home/swish/swish-e/conf/swish.conf.local
-T indexed_words
Indexing Data Source: "File-System"
Indexing "/home/swish/swish-e/test"

Checking dir "/home/swish/swish-e/test"...
  excel.xls - Using TXT2 parser -     Adding:[1:swishdocpath(11)]   'home'
Pos:1  Stuct:0x1 ( FILE )
    Adding:[1:swishdocpath(11)]   'swish'   Pos:2  Stuct:0x1 ( FILE )
    Adding:[1:swishdocpath(11)]   'swish'   Pos:3  Stuct:0x1 ( FILE )
    Adding:[1:swishdocpath(11)]   'e'   Pos:4  Stuct:0x1 ( FILE )
    Adding:[1:swishdocpath(11)]   'test'   Pos:5  Stuct:0x1 ( FILE )
    Adding:[1:swishdocpath(11)]   'excel'   Pos:6  Stuct:0x1 ( FILE )
    Adding:[1:swishdocpath(11)]   'xls'   Pos:7  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   ''   Pos:1  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   ''   Pos:2  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   ''   Pos:3  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'excel'   Pos:4  Stuct:0x1 ( FILE )
 (4 words)
  text.txt - Using DEFAULT (HTML2) parser -     Adding:[2:swishdocpath(11)]
'home'   Pos:1  Stuct:0x1 ( FILE )
    Adding:[2:swishdocpath(11)]   'swish'   Pos:2  Stuct:0x1 ( FILE )
    Adding:[2:swishdocpath(11)]   'swish'   Pos:3  Stuct:0x1 ( FILE )
    Adding:[2:swishdocpath(11)]   'e'   Pos:4  Stuct:0x1 ( FILE )
    Adding:[2:swishdocpath(11)]   'test'   Pos:5  Stuct:0x1 ( FILE )
    Adding:[2:swishdocpath(11)]   'text'   Pos:6  Stuct:0x1 ( FILE )
    Adding:[2:swishdocpath(11)]   'txt'   Pos:7  Stuct:0x1 ( FILE )
    Adding:[2:swishdefault(1)]   'this'   Pos:2  Stuct:0x9 ( BODY FILE )
    Adding:[2:swishdefault(1)]   'is'   Pos:3  Stuct:0x9 ( BODY FILE )
    Adding:[2:swishdefault(1)]   'the'   Pos:4  Stuct:0x9 ( BODY FILE )
    Adding:[2:swishdefault(1)]   'text'   Pos:5  Stuct:0x9 ( BODY FILE )
    Adding:[2:swishdefault(1)]   'for'   Pos:6  Stuct:0x9 ( BODY FILE )
    Adding:[2:swishdefault(1)]   'test'   Pos:7  Stuct:0x9 ( BODY FILE )
    Adding:[2:swishdefault(1)]   'indexing'   Pos:8  Stuct:0x9 ( BODY FILE )
    Adding:[2:swishdefault(1)]   'with'   Pos:9  Stuct:0x9 ( BODY FILE )
    Adding:[2:swishdefault(1)]   'swish'   Pos:10  Stuct:0x9 ( BODY FILE )
    Adding:[2:swishdefault(1)]   'e'   Pos:11  Stuct:0x9 ( BODY FILE )
 (10 words)
  hyper.htm - Using DEFAULT (HTML2) parser -     Adding:[3:swishdocpath(11)]
'home'   Pos:1  Stuct:0x1 ( FILE )
    Adding:[3:swishdocpath(11)]   'swish'   Pos:2  Stuct:0x1 ( FILE )
    Adding:[3:swishdocpath(11)]   'swish'   Pos:3  Stuct:0x1 ( FILE )
    Adding:[3:swishdocpath(11)]   'e'   Pos:4  Stuct:0x1 ( FILE )
    Adding:[3:swishdocpath(11)]   'test'   Pos:5  Stuct:0x1 ( FILE )
    Adding:[3:swishdocpath(11)]   'hyper'   Pos:6  Stuct:0x1 ( FILE )
    Adding:[3:swishdocpath(11)]   'htm'   Pos:7  Stuct:0x1 ( FILE )
    Adding:[3:author(13)]   'leonard'   Pos:2  Stuct:0x85 ( META HEAD FILE )
    Adding:[3:author(13)]   'bucharow'   Pos:3  Stuct:0x85 ( META HEAD FILE
)
    Adding:[3:swishdefault(1)]   'mytext'   Pos:6  Stuct:0x9 ( BODY FILE )
 (3 words)
  word.doc - Using TXT2 parser -     Adding:[4:swishdocpath(11)]   'home'
Pos:1  Stuct:0x1 ( FILE )
    Adding:[4:swishdocpath(11)]   'swish'   Pos:2  Stuct:0x1 ( FILE )
    Adding:[4:swishdocpath(11)]   'swish'   Pos:3  Stuct:0x1 ( FILE )
    Adding:[4:swishdocpath(11)]   'e'   Pos:4  Stuct:0x1 ( FILE )
    Adding:[4:swishdocpath(11)]   'test'   Pos:5  Stuct:0x1 ( FILE )
    Adding:[4:swishdocpath(11)]   'word'   Pos:6  Stuct:0x1 ( FILE )
    Adding:[4:swishdocpath(11)]   'doc'   Pos:7  Stuct:0x1 ( FILE )
    Adding:[4:swishdefault(1)]   ''   Pos:1  Stuct:0x1 ( FILE )
    Adding:[4:swishdefault(1)]   ''   Pos:2  Stuct:0x1 ( FILE )
    Adding:[4:swishdefault(1)]   ''   Pos:3  Stuct:0x1 ( FILE )
    Adding:[4:swishdefault(1)]   ''   Pos:4  Stuct:0x1 ( FILE )
    Adding:[4:swishdefault(1)]   '9'   Pos:5  Stuct:0x1 ( FILE )
    Adding:[4:swishdefault(1)]   '0'   Pos:6  Stuct:0x1 ( FILE )
 (6 words)
  acrobat.pdf - Using HTML2 parser -     Adding:[5:swishdocpath(11)]
'home'   Pos:1  Stuct:0x1 ( FILE )
    Adding:[5:swishdocpath(11)]   'swish'   Pos:2  Stuct:0x1 ( FILE )
    Adding:[5:swishdocpath(11)]   'swish'   Pos:3  Stuct:0x1 ( FILE )
    Adding:[5:swishdocpath(11)]   'e'   Pos:4  Stuct:0x1 ( FILE )
    Adding:[5:swishdocpath(11)]   'test'   Pos:5  Stuct:0x1 ( FILE )
    Adding:[5:swishdocpath(11)]   'acrobat'   Pos:6  Stuct:0x1 ( FILE )
    Adding:[5:swishdocpath(11)]   'pdf'   Pos:7  Stuct:0x1 ( FILE )
 - Filtered: /home/swish/swish-e/test/acrobat.pdf
    Adding:[5:author(13)]   'leonard'   Pos:2  Stuct:0x85 ( META HEAD FILE )
    Adding:[5:author(13)]   'bucharow'   Pos:3  Stuct:0x85 ( META HEAD FILE
)
    Adding:[5:swishdefault(1)]   'thu'   Pos:7  Stuct:0x5 ( HEAD FILE )
    Adding:[5:swishdefault(1)]   'aug'   Pos:8  Stuct:0x5 ( HEAD FILE )
    Adding:[5:swishdefault(1)]   '21'   Pos:9  Stuct:0x5 ( HEAD FILE )
    Adding:[5:swishdefault(1)]   '14'   Pos:10  Stuct:0x5 ( HEAD FILE )
    Adding:[5:swishdefault(1)]   '27'   Pos:11  Stuct:0x5 ( HEAD FILE )
    Adding:[5:swishdefault(1)]   '49'   Pos:12  Stuct:0x5 ( HEAD FILE )
    Adding:[5:swishdefault(1)]   '2003'   Pos:13  Stuct:0x5 ( HEAD FILE )
    Adding:[5:swishdefault(1)]   'acrobat'   Pos:16  Stuct:0x5 ( HEAD FILE )
    Adding:[5:swishdefault(1)]   'pdfmaker'   Pos:17  Stuct:0x5 ( HEAD FILE
)
    Adding:[5:swishdefault(1)]   '5'   Pos:18  Stuct:0x5 ( HEAD FILE )
    Adding:[5:swishdefault(1)]   '0'   Pos:19  Stuct:0x5 ( HEAD FILE )
    Adding:[5:swishdefault(1)]   'fr'   Pos:20  Stuct:0x5 ( HEAD FILE )
    Adding:[5:swishdefault(1)]   'word'   Pos:21  Stuct:0x5 ( HEAD FILE )
    Adding:[5:swishdefault(1)]   'no'   Pos:24  Stuct:0x5 ( HEAD FILE )
    Adding:[5:swishdefault(1)]   '25707'   Pos:27  Stuct:0x5 ( HEAD FILE )
    Adding:[5:swishdefault(1)]   'bytes'   Pos:28  Stuct:0x5 ( HEAD FILE )
    Adding:[5:swishdefault(1)]   'thu'   Pos:31  Stuct:0x5 ( HEAD FILE )
    Adding:[5:swishdefault(1)]   'aug'   Pos:32  Stuct:0x5 ( HEAD FILE )
    Adding:[5:swishdefault(1)]   '21'   Pos:33  Stuct:0x5 ( HEAD FILE )
    Adding:[5:swishdefault(1)]   '14'   Pos:34  Stuct:0x5 ( HEAD FILE )
    Adding:[5:swishdefault(1)]   '27'   Pos:35  Stuct:0x5 ( HEAD FILE )
    Adding:[5:swishdefault(1)]   '53'   Pos:36  Stuct:0x5 ( HEAD FILE )
    Adding:[5:swishdefault(1)]   '2003'   Pos:37  Stuct:0x5 ( HEAD FILE )
    Adding:[5:swishdefault(1)]   'yes'   Pos:40  Stuct:0x5 ( HEAD FILE )
    Adding:[5:swishdefault(1)]   '595'   Pos:43  Stuct:0x5 ( HEAD FILE )
    Adding:[5:swishdefault(1)]   'x'   Pos:44  Stuct:0x5 ( HEAD FILE )
    Adding:[5:swishdefault(1)]   '842'   Pos:45  Stuct:0x5 ( HEAD FILE )
    Adding:[5:swishdefault(1)]   'pts'   Pos:46  Stuct:0x5 ( HEAD FILE )
    Adding:[5:swishdefault(1)]   'a4'   Pos:47  Stuct:0x5 ( HEAD FILE )
    Adding:[5:swishdefault(1)]   '1'   Pos:50  Stuct:0x5 ( HEAD FILE )
    Adding:[5:swishdefault(1)]   '1'   Pos:53  Stuct:0x5 ( HEAD FILE )
    Adding:[5:swishdefault(1)]   '3'   Pos:54  Stuct:0x5 ( HEAD FILE )
    Adding:[5:swishdefault(1)]   'acrobat'   Pos:57  Stuct:0x5 ( HEAD FILE )
    Adding:[5:swishdefault(1)]   'distiller'   Pos:58  Stuct:0x5 ( HEAD FILE
)
    Adding:[5:swishdefault(1)]   '5'   Pos:59  Stuct:0x5 ( HEAD FILE )
    Adding:[5:swishdefault(1)]   '0'   Pos:60  Stuct:0x5 ( HEAD FILE )
    Adding:[5:swishdefault(1)]   'windows'   Pos:61  Stuct:0x5 ( HEAD FILE )
    Adding:[5:swishdefault(1)]   'yes'   Pos:64  Stuct:0x5 ( HEAD FILE )
    Adding:[5:swishdefault(1)]   'eins'   Pos:67  Stuct:0x5 ( HEAD FILE )
    Adding:[5:swishdefault(1)]   'zwei'   Pos:68  Stuct:0x5 ( HEAD FILE )
    Adding:[5:swishdefault(1)]   'drei'   Pos:69  Stuct:0x5 ( HEAD FILE )
 (43 words)
Received on Thu Sep 4 11:33:13 2003