Hello there,
I've just installed Swish-e 2.4.5 on our server (FreeBSD OS). I am trying
to index over 100,000 HTML documents. These documents have the following
example extraneous tags at the beginning of the HTML files:
<DOCUMENT><FILENAME><DESCRIPTION><SEQUENCE>
I realize these are not valid HTML tags, but I didn't write these HTML docs.
Unfortunately I cannot change the original HTML docs to remove these tags
(or to insert <META> around them), so I'm looking for a way to get Swish-e
to ignore them.
I've spent some quality time with the Swish-e documentation and archives,
but everything seems to reference either ignoring meta tags (and these are
not meta tags) or ignoring specific tags while using the XML parser (but I
assume I need the HTML parser).
I've tried the following in the config file (swish.conf), with IgnoreWords
by itself, then IgnoreMetaTags by itself, then added Undefined MetaTags. I
get the exact same results/errors each time. I also tried commenting out
"DefaultContents HTML*" and also got the same results/errors (shown at the
bottom of this message).
# Tell swish-e what to index
IndexDir /usr/local/apache/htdocs/documents/
# Only index HTML files
IndexOnly .htm .html
# Use the HTML parser
DefaultContents HTML*
# Ignore words list
IgnoreWords /usr/local/apache/swish-e-2.4.5/ignorewords.txt
# Ignore certain tags
IgnoreMetaTags DOCUMENT FILENAME DESCRIPTION SEQUENCE
UndefinedMetaTags ignore
I continue to get the following error messages:
/usr/local/apache/htdocs/documents/doc.htm:1: error: Tag document invalid
<DOCUMENT>
^
/usr/local/apache/htdocs/documents/doc.htm:2: error: Tag type invalid
<TYPE>Type text here
^
/usr/local/apache/htdocs/documents/doc.htm:3: error: Tag sequence invalid
<SEQUENCE>4
^
/usr/local/apache/htdocs/documents/doc.htm:4: error: Tag filename invalid
<FILENAME>doc.htm
^
/usr/local/apache/htdocs/documents/doc.htm:5: error: Tag description invalid
<DESCRIPTION>Description here
^
/usr/local/apache/htdocs/documents/doc.htm:6: error: Tag text invalid
<TEXT>
^
/usr/local/apache/htdocs/documents/doc.htm:7: error: htmlParseStartTag:
misplaced <html> tag
<HTML><HEAD>
^
/usr/local/apache/htdocs/documents/doc.htm:7: error: htmlParseStartTag:
misplaced <head> tag
<HTML><HEAD>
^
/usr/local/apache/htdocs/documents/doc.htm:9: error: Unexpected end tag :
head
</HEAD>
^
/usr/local/apache/htdocs/documents/doc.htm:10: error: htmlParseStartTag:
misplaced <body> tag
<BODY BGCOLOR="WHITE">
^
Thanks so much for any help you can provide!
Best Regards,
Kathleen
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Thu Mar 1 17:55:00 2007