Hi
The error report seems to be related to the *directory name* itself. I
determined this via:
1. I replaced the search directive in web_1.conf of
SwishProgParameters default http://localhost:104
with two listings - one to a specific file, and one to the directory that
contains that specific file:
SwishProgParameters default http://localhost:104/_docs/test3
http://localhost:104/_docs/test3/Reception-duties.doc
2. And there is a 2nd file in /test3, this other file being
Reception-duties.doc renamed as Reception-duties-2.doc
3. This is the output
(& is it normal for swish to report it is indexing 'Data Source' and
"spider.pl"):
swish-e -S prog -c /share/MD0_DATA/swish-e-files/swish-e-conf/web_1.conf
Indexing Data Source: "External-Program"
Indexing "spider.pl"
External Program found: /opt/lib/swish-e/spider.pl
Missing argument in sprintf at /opt/lib/swish-e/spider.pl line 38.
Missing argument in sprintf at /opt/lib/swish-e/spider.pl line 38.
/opt/lib/swish-e/spider.pl: Reading parameters from 'default'
Summary for: http://localhost:104/_docs/test3/Reception-duties.doc
Connection: Close: 1 (1.0/sec)
Total Bytes: 1,217 (1217.0/sec)
Total Docs: 1 (1.0/sec)
Unique URLs: 1 (1.0/sec)
application/msword->text/plain: 1 (1.0/sec)
Warning: document 'http://localhost:104/_docs/test3/' could not be encoded
to charset 'ISO-8859-1'
Summary for: http://localhost:104/_docs/test3
Connection: Close: 1 (1.0/sec)
Connection: Keep-Alive: 2 (2.0/sec)
Duplicates: 1 (1.0/sec)
Location Redirects: 1 (1.0/sec)
Off-site links: 5 (5.0/sec)
Total Bytes: 2,307 (2307.0/sec)
Total Docs: 3 (3.0/sec)
Unique URLs: 4 (4.0/sec)
application/msword->text/plain: 1 (1.0/sec)
text/html: 2 (2.0/sec)
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 145 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: Complete
Writing word hash: Complete
Writing word data: Complete
145 unique words indexed.
5 properties sorted.
4 files indexed. 3,524 total bytes. 450 total words.
Elapsed time: 00:00:03 CPU time: 00:00:00
Indexing done!
4. Search works:
swish-e -w opening
# SWISH format: 2.4.7
# Search words: opening
# Removed stopwords:
# Number of hits: 2
# Search time: 0.001 seconds
# Run time: 0.064 seconds
1000 http://localhost:104/_docs/test3/Reception-duties-2.doc
"Reception-duties-2.doc" 1217
1000 http://localhost:104/_docs/test3/Reception-duties.doc
"Reception-duties.doc" 1217
Thanks
Dr Michael Daly wrote on 3/14/12 7:40 AM:
> Maybe this is related to my previous problem, maybe not:
the .xls file errors probably are related.
> Whereby the content of web_1.conf is:
> IndexDir spider.pl
> SwishProgParameters default http://localhost:104
> StoreDescription TXT 200
> StoreDescription HTML <body> 200
>
> invoking this via:
> # swish-e -S prog -c
> /share/MD0_DATA/swish-e-files/swish-e-conf/web_1.conf
>
> outputs:
> Indexing Data Source: "External-Program"
> Indexing "spider.pl"
> External Program found: /opt/lib/swish-e/spider.pl
> Missing argument in sprintf at /opt/lib/swish-e/spider.pl line 38.
> Missing argument in sprintf at /opt/lib/swish-e/spider.pl line 38.
> /opt/lib/swish-e/spider.pl: Reading parameters from 'default'
> Warning: document 'http://localhost:104' could not be encoded to charset
> 'ISO-8859-1'
break it down to one file and see if you can isolate the problem. E.g. if
you
can fetch http://localhost:104 and write its contents to a file and then
index
that file directly with swish-e, then you know the problem is in the
spider
config. If you can't index the file with swish-e, then you know the
problem is
in your swish-e config and/or your document.
Encoding problems are common. Make sure your content is ISO-8859-1 or some
other
single-byte encoding, or is UTF-8 and be prepared that swish-e will
convert it
to 8859 internally when indexing.
--
Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com
_______________________________________________
Users mailing list
Users(at)not-real.lists.swish-e.org
http://lists.swish-e.org/listinfo/users
_______________________________________________
Users mailing list
Users(at)not-real.lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Thu Mar 15 2012 - 02:01:25 GMT