Skip to main content.
home | support | download

Back to List Archive

Re: Error Message: Index file error: Could not open

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Mon Jun 07 2004 - 14:31:52 GMT
On Mon, Jun 07, 2004 at 09:03:37AM -0400, Kaplan, Andrew H. wrote:
> Hi there --
> 
> Here is the output that I encountered when the ParseWarnLevel 9 was added to
> the swish.conf file:

What do you see when you look at those messages?


 
> ahk@radonckb:/www> sudo /usr/local/bin/swish-e -c /www/swish.conf -v 3

Is that root?  Please don't run as root.

> Checking dir "/www"...
>   Zmed Intracranial and Head  Neck Modules
> 4-04.pdfhttp://132.183.12.176/Zmed Intracranial and Head  Neck Modules
> 4-04.pdf:4: error: htmlParseStartTag: invalid element name
> <</Length 6 0 R/Filter /FlateDecode>>
>  ^
>  - Using DEFAULT (HTML2) parser -  (15 words)
>   index.swish-e.temp - Using DEFAULT (HTML2) parser -  (3 words)
>   image1.jpg - Using DEFAULT (HTML2) parser -  (1 words)
>   image2.jpg - Using DEFAULT (HTML2) parser -  (1 words)

Well, last time I checked jpegs were not text.  Nor are pdfs.  Those
require a filter to be used.  You can do that with Swish, but in your
case I'd just do this:

    /usr/local/lib/swish-e/spider.pl your.spider.config.file > output.txt

Then look at output.txt.  Your path to spider.pl might be different than
what I showed there.

You can then index that content by:

    swish-e -c my_config -S prog -i stdin < output.txt

or if you like:

    cat output.txt | swish-e -c my_config -S prog -i stdin

Anyway, you need to start working a bit more systematically.  I can help
explain how the parts fit together (or maybe I already did), but then
you are going to have to figure out what's not working yourself.

Get the spider to fetch and filter (convert from pdf) some test file.
Use the spider config options to limit to maybe one or two files.  Then
test like I explained before.





>   Mass General Zmed SAPIC quote 5-23-04 -68.pdfhttp://132.183.12.176/Mass
> General Zmed SAPIC quote 5-23-04 -68.pdf:4: error: htmlParseStartTag:
> invalid element name
> <</Length 6 0 R/Filter /FlateDecode>>
>  ^
> http://132.183.12.176/Mass General Zmed SAPIC quote 5-23-04 -68.pdf:6:
> error: htmlParseEntityRef: expecting ';'
> xí]¶}¯àCª"We         äMcKò&¿º¬h#ë[ù1ÿ?(
>                                                  ^
>  - Using DEFAULT (HTML2) parser -  (22 words)
>   index.html.enhttp://132.183.12.176/index.html.en:1: error:
> htmlParseStartTag: invalid element name
> <?xml version="1.0" encoding="iso-8859-1"?>
> ^
> http://132.183.12.176/index.html.en:2: error: Misplaced DOCTYPE declaration
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
> "http://www.w3.o
> ^
>  - Using DEFAULT (HTML2) parser -  (175 words)
>   Zmed SonArray Plus 5-04.pdfhttp://132.183.12.176/Zmed SonArray Plus
> 5-04.pdf:4: error: htmlParseStartTag: invalid element name
> <</Length 6 0 R/Filter /FlateDecode>>
>  ^
> http://132.183.12.176/Zmed SonArray Plus 5-04.pdf:7: error:
> htmlParseEntityRef: no name
> ïù9ê%V«èì?XPÿl¹
>                                3é$ü
>                                        k£Nâx~¨#¥½^
>  
> ^
> http://132.183.12.176/Zmed SonArray Plus 5-04.pdf:7: error: Tag  invalid
> fZTp{< ¸ÿ¼6\kñR$ì^·QS]èc}½b¦ð
>             ^
> http://132.183.12.176/Zmed SonArray Plus 5-04.pdf:7: error: Couldn't find
> end of Start Tag 
> fZTp{< ¸ÿ¼6\kñR$ì^·QS]èc}½b¦ð
>             ^
> http://132.183.12.176/Zmed SonArray Plus 5-04.pdf:13: error:
> htmlParseStartTag: invalid element name
> <</Type/Page/MediaBox [0 0 612 792]
>  ^
> http://132.183.12.176/Zmed SonArray Plus 5-04.pdf:24: error:
> htmlParseStartTag: invalid element name
> << /Type /Pages /Kids [
>  ^
> http://132.183.12.176/Zmed SonArray Plus 5-04.pdf:30: error:
> htmlParseStartTag: invalid element name
> <</Type /Catalog /Pages 3 0 R
>  ^
> http://132.183.12.176/Zmed SonArray Plus 5-04.pdf:34: error:
> htmlParseStartTag: invalid element name
> <</Type/ExtGState/Name/R9/TR/Identity/BG 7 0 R/UCR 8 0 R/OPM 1/SM 0.02>>
>  ^
> http://132.183.12.176/Zmed SonArray Plus 5-04.pdf:37: error:
> htmlParseStartTag: invalid element name
> <</Subtype/Image
>  ^
>  - Using DEFAULT (HTML2) parser -  (72 words)
>   howtopage.htmhttp://132.183.12.176/howtopage.htm:1: error:
> htmlParseStartTag: invalid element name
> <?xml version="1.0" encoding="iso-8859-1"?>
> ^
> http://132.183.12.176/howtopage.htm:2: error: Misplaced DOCTYPE declaration
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
> "http://www.w3.o
> ^
>  - Using DEFAULT (HTML2) parser -  (191 words)
>   radonckbmain.htmhttp://132.183.12.176/radonckbmain.htm:1: error:
> htmlParseStartTag: invalid element name
> <?xml version="1.0" encoding="iso-8859-1"?>
> ^
> http://132.183.12.176/radonckbmain.htm:2: error: Misplaced DOCTYPE
> declaration
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
> "http://www.w3.o
> ^
>  - Using DEFAULT (HTML2) parser -  (175 words)
>   index.swish-e.prop.temp - Using DEFAULT (HTML2) parser -  (2 words)
>   tmi03final.pdfhttp://132.183.12.176/tmi03final.pdf:2: error:
> htmlParseStartTag: invalid element name
> 3 0 obj <<
>          ^
>  - Using DEFAULT (HTML2) parser -  (13 words)
>   swish.confhttp://132.183.12.176/swish.conf:2: error: htmlParseStartTag:
> misplaced <body> tag
> StoreDescription HTML* <body> 200000
>                             ^
>  - Using DEFAULT (HTML2) parser -  (18 words)
> 
> Removing very common words...
> no words removed.
> Writing main index...
> Sorting words ...
> Sorting 291 words alphabetically
> Writing header ...
> Writing index entries ...
>   Writing word text: Complete
>   Writing word hash: Complete
>   Writing word data: Complete
> 291 unique words indexed.
> 5 properties sorted.
> 12 files indexed.  5,756,864 total bytes.  803 total words.
> Elapsed time: 00:00:02 CPU time: 00:00:00
> Indexing done!
> ahk@radonckb:/www>
> 
> -----Original Message-----
> From: swish-e@sunsite.berkeley.edu
> [mailto:swish-e@sunsite.berkeley.edu]On Behalf Of Bill Moseley
> Sent: Sunday, June 06, 2004 12:12 PM
> To: Multiple recipients of list
> Subject: [SWISH-E] Re: Error Message: Index file error: Could not open
> 
> 
> On Sun, Jun 06, 2004 at 11:56:53AM -0400, Kaplan, Andrew H. wrote:
> > Hi there --
> > 
> > I'm sorry for sounding stupid, but could you elaborate on making sure
> > that "Head" is in the index? Also, aside from the cgi script, what is
> > the command syntax I would use to search the index? Thanks.
> 
> So, the situation is you index some files and then you search for "head"
> and it says "no results" but you are sure it should be found because you
> know it's in the file "body_parts.html".
> 
> So then you run swish like this:
> 
>     swish-e -c myconfig -i body_parts.html -T indexed_words | grep head
> 
> and you see something like:
> 
>        Adding:[1:swishdefault(1)]   'head'   Pos:24  Stuct:0x9 ( BODY FILE )
> 
> which says the word "head" was indexed in file number 1 under metaname
> "swishdefault" at word position number 24 and is in the BODY of the
> document.
> 
> Then you know you can do:
> 
>     swish-e -w head
> or
>     swish-e -w swishdefault=(head)
> 
> and swish-e will find it.
> 
> Now, if you don't see "head" in the output you then look at why it's not
> getting indexed.  What I'd likely do is run without grep
> 
>     swish-e -c myconfig -i body_parts.html -T indexed_words | less
> 
> and then look for words that you know are around "head" in the document
> and that might give you an idea what to look for.
> 
> Maybe you have a format error in body_parts.html?  Adding to your swish
> config file:
> 
>     ParserWarnLevel 9
> 
> might generate some warnings about the structure of your document.
> 
> Maybe "head" is in an HTML comment?  Then you need to enable indexing of
> comments.
> 
> Maybe the above all works find, but when spidering the file is skipped?
> If that's the case then you need to figure out why.  spider.pl has
> debugging features to tell you why a file is skipped.
> 
> The answer is divide et impera.
> 
> 
> 
> -- 
> Bill Moseley
> moseley@hank.org
> 

-- 
Bill Moseley
moseley@hank.org
Received on Mon Jun 7 07:31:53 2004