Skip to main content.
home | support | download

Back to List Archive

Re: Error Message: Index file error: Could not open

From: Kaplan, Andrew H. <AHKAPLAN(at)not-real.PARTNERS.ORG>
Date: Mon Jun 07 2004 - 13:05:20 GMT
Hi there --

Here is the output that I encountered when the ParseWarnLevel 9 was added to
the swish.conf file:

ahk@radonckb:/www> sudo /usr/local/bin/swish-e -c /www/swish.conf -v 3
Parsing config file '/www/swish.conf'
Indexing Data Source: "File-System"
Indexing "/www"

Checking dir "/www"...
  Zmed Intracranial and Head  Neck Modules
4-04.pdfhttp://132.183.12.176/Zmed Intracranial and Head  Neck Modules
4-04.pdf:4: error: htmlParseStartTag: invalid element name
<</Length 6 0 R/Filter /FlateDecode>>
 ^
 - Using DEFAULT (HTML2) parser -  (15 words)
  index.swish-e.temp - Using DEFAULT (HTML2) parser -  (3 words)
  image1.jpg - Using DEFAULT (HTML2) parser -  (1 words)
  image2.jpg - Using DEFAULT (HTML2) parser -  (1 words)
  Mass General Zmed SAPIC quote 5-23-04 -68.pdfhttp://132.183.12.176/Mass
General Zmed SAPIC quote 5-23-04 -68.pdf:4: error: htmlParseStartTag:
invalid element name
<</Length 6 0 R/Filter /FlateDecode>>
 ^
http://132.183.12.176/Mass General Zmed SAPIC quote 5-23-04 -68.pdf:6:
error: htmlParseEntityRef: expecting ';'
xí]¶}¯àCª"We         äMcKò&¿º¬h#ë[ù1ÿ?(
                                                 ^
 - Using DEFAULT (HTML2) parser -  (22 words)
  index.html.enhttp://132.183.12.176/index.html.en:1: error:
htmlParseStartTag: invalid element name
<?xml version="1.0" encoding="iso-8859-1"?>
^
http://132.183.12.176/index.html.en:2: error: Misplaced DOCTYPE declaration
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.o
^
 - Using DEFAULT (HTML2) parser -  (175 words)
  Zmed SonArray Plus 5-04.pdfhttp://132.183.12.176/Zmed SonArray Plus
5-04.pdf:4: error: htmlParseStartTag: invalid element name
<</Length 6 0 R/Filter /FlateDecode>>
 ^
http://132.183.12.176/Zmed SonArray Plus 5-04.pdf:7: error:
htmlParseEntityRef: no name
ïù9ê%V«èì?XPÿl¹
                               3é$ü
                                       k£Nâx~¨#¥½^
 
^
http://132.183.12.176/Zmed SonArray Plus 5-04.pdf:7: error: Tag  invalid
fZTp{< ¸ÿ¼6\kñR$ì^·QS]èc}½b¦ð
            ^
http://132.183.12.176/Zmed SonArray Plus 5-04.pdf:7: error: Couldn't find
end of Start Tag 
fZTp{< ¸ÿ¼6\kñR$ì^·QS]èc}½b¦ð
            ^
http://132.183.12.176/Zmed SonArray Plus 5-04.pdf:13: error:
htmlParseStartTag: invalid element name
<</Type/Page/MediaBox [0 0 612 792]
 ^
http://132.183.12.176/Zmed SonArray Plus 5-04.pdf:24: error:
htmlParseStartTag: invalid element name
<< /Type /Pages /Kids [
 ^
http://132.183.12.176/Zmed SonArray Plus 5-04.pdf:30: error:
htmlParseStartTag: invalid element name
<</Type /Catalog /Pages 3 0 R
 ^
http://132.183.12.176/Zmed SonArray Plus 5-04.pdf:34: error:
htmlParseStartTag: invalid element name
<</Type/ExtGState/Name/R9/TR/Identity/BG 7 0 R/UCR 8 0 R/OPM 1/SM 0.02>>
 ^
http://132.183.12.176/Zmed SonArray Plus 5-04.pdf:37: error:
htmlParseStartTag: invalid element name
<</Subtype/Image
 ^
 - Using DEFAULT (HTML2) parser -  (72 words)
  howtopage.htmhttp://132.183.12.176/howtopage.htm:1: error:
htmlParseStartTag: invalid element name
<?xml version="1.0" encoding="iso-8859-1"?>
^
http://132.183.12.176/howtopage.htm:2: error: Misplaced DOCTYPE declaration
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.o
^
 - Using DEFAULT (HTML2) parser -  (191 words)
  radonckbmain.htmhttp://132.183.12.176/radonckbmain.htm:1: error:
htmlParseStartTag: invalid element name
<?xml version="1.0" encoding="iso-8859-1"?>
^
http://132.183.12.176/radonckbmain.htm:2: error: Misplaced DOCTYPE
declaration
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.o
^
 - Using DEFAULT (HTML2) parser -  (175 words)
  index.swish-e.prop.temp - Using DEFAULT (HTML2) parser -  (2 words)
  tmi03final.pdfhttp://132.183.12.176/tmi03final.pdf:2: error:
htmlParseStartTag: invalid element name
3 0 obj <<
         ^
 - Using DEFAULT (HTML2) parser -  (13 words)
  swish.confhttp://132.183.12.176/swish.conf:2: error: htmlParseStartTag:
misplaced <body> tag
StoreDescription HTML* <body> 200000
                            ^
 - Using DEFAULT (HTML2) parser -  (18 words)

Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 291 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
291 unique words indexed.
5 properties sorted.
12 files indexed.  5,756,864 total bytes.  803 total words.
Elapsed time: 00:00:02 CPU time: 00:00:00
Indexing done!
ahk@radonckb:/www>

-----Original Message-----
From: swish-e@sunsite.berkeley.edu
[mailto:swish-e@sunsite.berkeley.edu]On Behalf Of Bill Moseley
Sent: Sunday, June 06, 2004 12:12 PM
To: Multiple recipients of list
Subject: [SWISH-E] Re: Error Message: Index file error: Could not open


On Sun, Jun 06, 2004 at 11:56:53AM -0400, Kaplan, Andrew H. wrote:
> Hi there --
> 
> I'm sorry for sounding stupid, but could you elaborate on making sure
> that "Head" is in the index? Also, aside from the cgi script, what is
> the command syntax I would use to search the index? Thanks.

So, the situation is you index some files and then you search for "head"
and it says "no results" but you are sure it should be found because you
know it's in the file "body_parts.html".

So then you run swish like this:

    swish-e -c myconfig -i body_parts.html -T indexed_words | grep head

and you see something like:

       Adding:[1:swishdefault(1)]   'head'   Pos:24  Stuct:0x9 ( BODY FILE )

which says the word "head" was indexed in file number 1 under metaname
"swishdefault" at word position number 24 and is in the BODY of the
document.

Then you know you can do:

    swish-e -w head
or
    swish-e -w swishdefault=(head)

and swish-e will find it.

Now, if you don't see "head" in the output you then look at why it's not
getting indexed.  What I'd likely do is run without grep

    swish-e -c myconfig -i body_parts.html -T indexed_words | less

and then look for words that you know are around "head" in the document
and that might give you an idea what to look for.

Maybe you have a format error in body_parts.html?  Adding to your swish
config file:

    ParserWarnLevel 9

might generate some warnings about the structure of your document.

Maybe "head" is in an HTML comment?  Then you need to enable indexing of
comments.

Maybe the above all works find, but when spidering the file is skipped?
If that's the case then you need to figure out why.  spider.pl has
debugging features to tell you why a file is skipped.

The answer is divide et impera.



-- 
Bill Moseley
moseley@hank.org
Received on Mon Jun 7 06:05:20 2004