Skip to main content.
home | support | download

Back to List Archive

ampersand issue and PDF syntax highlighting

From: Eric Jobidon <eric(at)>
Date: Sat Aug 05 2006 - 22:05:37 GMT
Hi folks,
I am using swish-e to index and search PDF files and have the following 2
1- In a specific setup, I am running a minimalist environment (win32/no
PERL/no web server) and would like to have the "ampersand" character handled
correctly when indexing PDF docs. I am using the "HTML*" filter to process
the PDF files and am telling xpdf to output HTML instead of raw text (with
FileFilter .PDF pdftotext.exe '-htmlmeta "%p" -'
command in the config file). Everything indexes fine, but if the file
happens to contain a "&" symbol (not just in a URL, but anywhere in a
sentence), I get an error message to the effect that the parser is expecting
"&amp;" instead of "&". This is the parser being a good citizen, and
informing me that "&" on its own is not an HTML entity. Great, there is
probably no side effect to this, but I still want to fix this.
I guess I could grep the output of xpdf and replace "&" with "&amp;", but I
am thinking there is a much simpler answer to this, either as a config
directive in .xpdfrc (replacing "&" with "&amp;") or in the swish-e config
file (maybe treating "&" as a stop word). Any thoughts? Anyone had the same
issue? How would you fix it?
2- (This question pertains to a different environment, on a unix box) I want
to offer syntax highlighting in the search results page. All the indexed
docs are PDF files. And they are all fairly large PDF files (some are over
100MB in size, and splitting them up is not an option). Parsing each file
takes 3~5 seconds with xpdf, so dynamic highlighting with
swish.cgi/search.cgi would be painfully slow if all the search results files
would need to be parsed with the cgi file. How can I offer syntax
highlighting in the search results page when the files are large PDF docs?
Has anyone encountered a similar situation? How was it handled? Any new
One avenue I am considering is to save the output from xpdf in a html file
and indexing only that html file (ignoring the PDF altogether for the
indexing). On a search, the cgi script would then parse the html file,
highlight the content and display it to the user. The displayed document URL
could then simply be transformed from ".html" to ".PDF". Think this would
work? Any other avenue I should explore?

Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
Received on Sat Aug 5 15:05:42 2006