Re: ampersand issue and PDF syntax highlighting

From: Bill Moseley <moseley(at)>
Date: Sun Aug 06 2006 - 04:14:18 GMT
On Sat, Aug 05, 2006 at 03:01:03PM -0700, Eric Jobidon wrote:
> I guess I could grep the output of xpdf and replace "&" with "&amp;", but I
> am thinking there is a much simpler answer to this, either as a config
> directive in .xpdfrc (replacing "&" with "&amp;") or in the swish-e config
> file (maybe treating "&" as a stop word). Any thoughts? Anyone had the same
> issue? How would you fix it?

See if xpdf can't produce correct html?

Is the error:

    error: htmlParseEntityRef: no name

That's from libxml2 and setting this in your swish config might
suppress it:

    ParserWarnLevel 0

> 2- (This question pertains to a different environment, on a unix box) I want
> to offer syntax highlighting in the search results page. All the indexed
> docs are PDF files. And they are all fairly large PDF files (some are over
> 100MB in size, and splitting them up is not an option).

Splitting them up while indexing is probably your best, if not only,
option.  Then your search results could be, for example, targeted to
a specific page or chapter of the pdf file, assuming you can figure
out boundaries.

> One avenue I am considering is to save the output from xpdf in a html file
> and indexing only that html file (ignoring the PDF altogether for the
> indexing). On a search, the cgi script would then parse the html file,
> highlight the content and display it to the user. The displayed document URL
> could then simply be transformed from ".html" to ".PDF". Think this would
> work? Any other avenue I should explore?

Typically, the content gets stored as a property in the swish index
and that's what is displayed for highlighting.  So, what the source
document is doesn't really matter.

Bill Moseley

Received on Sat Aug 5 21:14:28 2006