On Thu, Oct 28, 2004 at 10:13:31AM -0400, Antonio Barrera wrote:
> Would this apply similarly to using xpdf to parse PDF docs?
>
> IndexContents HTML* .htm .html .shtml .php
> IndexContents TXT* .txt .log .text .pdf
> IndexContents XML* .xml
>
> StoreDescription TXT* 10000
> StoreDescription HTML* <body>
Maybe. Depends on how the PDF files are indexed. If you are using
spider.pl (with SWISH::Filter) then the document type is passed
directly to swish:
$ spider.pl default http://localhost/apache/test.pdf 2>/dev/null | head -5
Path-Name: http://localhost/apache/test.pdf
Content-Length: 12589
Last-Mtime: 1064946675
Document-Type: HTML*
So that tells swish what type of file is being indexed:
$ spider.pl default http://localhost/apache/test.pdf 2>/dev/null | swish-e -v9 -i stdin -S prog
Indexing Data Source: "External-Program"
Indexing "stdin"
http://localhost/apache/test.pdf - Using HTML2 parser - (2301 words)
[...]
See how it says using HTML2 parser. Now if you just index a file
without telling the parser type it says:
$ swish-e -i 1.html -v9
Indexing Data Source: "File-System"
Indexing "1.html"
Checking file "1.html"...
1.html - Using DEFAULT (HTML2) parser - (12 words)
So it's saying "DEFAULT" there.
If you are not using spider.pl or some -S prog program that passes in
the Document-Type: header then, yes, you would need to use
DefaultContents or IndexContents to set the content type.
I guess the reasoning is that storedescription works differently for
different types of documents, so it needs to be told what the document
is.
Here's my comment from many years ago:
http://swish-e.org/current/docs/SWISH-3.0.html#Switch_to_Content_Types
--
Bill Moseley
moseley@hank.org
Unsubscribe from or help with the swish-e list:
http://swish-e.org/Discussion/
Help with Swish-e:
http://swish-e.org/current/docs
swish-e@sunsite.berkeley.edu
Received on Thu Oct 28 07:29:38 2004