On Fri, Dec 02, 2005 at 01:30:25PM -0800, David Larkin wrote:
> I've got swish-e to index a directory with mixed content (HTML DOC PDF XLS PPT files) and swish.cgi produces half sensible output.
>
> At first it gave "(null)" where i'd expect to see the context of the string i was searching for.
>
> So, i added
>
> StoreDescription HTML* <body> 20000
> StoreDescription TXT* 20000
> StoreDescription XML* <desc> 20000
>
> and the "(null)" dissapeared , but still no context
>
> so i added
>
> IndexContents HTML* .htm .html .shtml
> IndexContents TXT* .txt .log .text
> IndexContents XML* .xml
>
> and now i get the context i expect for HTM files.
>
> Can i get it to work for other filetypes ?
>
> The documentation suggests HTML,TXT,XML are only legal arguments to StoreDescription.
That allows assigning the IndexContents based on each parser. I'm not
sure it makes much sense.
I assume you are using -S prog for indexing. You should look at what
each file reports in its header.
For example:
~$ /usr/local/lib/swish-e/spider.pl default file:///home/moseley/050819-securing-mac-os-x-tiger.pdf | head
/usr/local/lib/swish-e/spider.pl: Reading parameters from 'default'
Path-Name: file:///home/moseley/050819-securing-mac-os-x-tiger.pdf
Content-Length: 90697
Last-Mtime: 1126391567
Document-Type: HTML* <<<<<<<---- notice this
<html>
<head>
<title>Microsoft Word - 7 - Securing Mac OS X 10 4 Tiger v1.0.doc</title>
<meta name="author" content="martin">
<meta name="creationdate" content="Fri Aug 19 13:07:33 2005">
THere's a PDF file that was filtered into HTML. So it's telling swish
to use the HTML* parser. That will *override* anything you set in
your swish config file.
So to store the description for that you would need:
StoreDescription HTML* <body>
Again, here you can see that the description is indeed saved:
$ /usr/local/lib/swish-e/spider.pl default file:///home/moseley/050819-securing-mac-os-x-tiger.pdf | swish-e -c c -S prog -i stdin -v0 -T properties
/usr/local/lib/swish-e/spider.pl: Reading parameters from 'default'
(sorry the spider output and swish-e output are mixed a bit)
Summary for: file:///home/moseley/050819-securing-mac-os-x-tiger.pdf
Connection: Close: 1 (1.0/sec)
Total Bytes: 90,697 (90697.0/sec)
Total Docs: 1 (1.0/sec)
Unique URLs: 1 (1.0/sec)
application/pdf->text/html: 1 (1.0/sec)
swishdocpath: 6 ( 55) S: "file:///home/moseley/050819-securing-mac-os-x-tiger.pdf"
swishtitle: 7 ( 58) S: "Microsoft Word - 7 - Securing Mac OS X 10 4 Tiger v1.0.doc"
swishdocsize: 8 ( 4) N: "90697"
swishlastmodified: 9 ( 4) D: "2005-09-10 15:32:47 PDT"
swishdescription:10 (88766) S: "The natural choice for information security solutions A Corsaire White Paper: Securing Mac OS X Author Document Reference Document Revision Date Stephen de Vries Securing Mac OS X 10.4 Tiger v1.0.doc 1.0 Released 19 August 2005 © Copyright 2000 2005 Corsaire Limited All Rights Reserved A Corsaire W ..."
--
Bill Moseley
moseley@hank.org
Unsubscribe from or help with the swish-e list:
http://swish-e.org/Discussion/
Help with Swish-e:
http://swish-e.org/current/docs
swish-e@sunsite.berkeley.edu
Received on Fri Dec 2 13:50:09 2005