Skip to main content.
home | support | download

Back to List Archive

[swish-e] Hit-highlighting of PDF files

From: Ostermayr Richard Dr. <Richard.Ostermayr(at)not-real.dpma.de>
Date: Wed Oct 22 2008 - 16:06:10 GMT
Hi,

 

I'd like to present a small piece of demo CGI code for Hit-highlighting
of PDF files.

For better understanding, the code is held very simple, but nevertheless
fully functional.

The highlighting is achieved by simply appending to the PDF file a small
piece of Javascript, 

which performs a query similar to the SWISH-E query every time the PDF
file is opened.

For this purpose, The SWISH-E query is transformed to fit the
requirements of a PDF query.

 

The advantages of this approach are: 

1. The exploitation of the proximity search feature described in the
Acrobat JavaScript Scripting Guide

(as far as I know, a proximity search is not possible in the Adobe
Reader search menu).

2. The shift of the CPU load caused by the PDF search to the client.

3. The persistence of the generated query (it is a PDF OpenAction).

 

Since the script works with IO::String it not yet well suited for huge
files, but this is easy to overcome 

by using a temporary file instead. On our server this technique works
for PDFs > 100 MB.

 

Since PDF files form a major part of every intranet it may be
advantageous to incorporate 

this technique into the SWISH-E package.

 

Any comments are welcome.

 

Best regards

 

 

Richard Ostermayr

 

 

 

#!D:\perl\perl.exe -w

 

########################################################################

# Skeleton CGI script for transferring SWISH-E queries into PDF files. #


#                                                                      #

# CGI params:                                                          #


# url= URL of PDF file                                                 #

# query = SWISHE-E query                                               #

# type =  search type (word, near/proximity or phrase)                 #

#                                                                      #

# Copyright 2008 Richard Ostermayr - All rights reserved.              #

#                                                                      #

#    This program is free software; you can redistribute it and/or     #

#    modify it under the terms of the GNU General Public License       #

#    as published by the Free Software Foundation; either version      #

#    2 of the License, or (at your option) any later version.          #

#                                                                      #

#    This program is distributed in the hope that it will be useful,   #

#    but WITHOUT ANY WARRANTY; without even the implied warranty of    #

#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the     #

#    GNU General Public License for more details.                      #

#                                                                      #

#    The above lines must remain at the top of this program            #

########################################################################

 

use strict;

use LWP::UserAgent;

use Text::PDF::Utils;

use Text::PDF::Page;

use Text::PDF::Pages;

use Text::PDF::File;

use IO::String;

use CGI;

 

my $q = new CGI;

my ($query_dist, $query_phrase, $dist, $pdfstr);

 

my $url = $q->param('url');

my $query = uc($q->param('query'));

my $type = $q->param('type');

 

# Use of truncation: A (single) truncated search term in the query turns
matchWholeWord off

 if ($query =~ /\w[\*\?]/){ 

       $query =~ s/[\*\?]//g;

     $pdfstr = "search.matchWholeWord = false;";  

 }else{ 

     $pdfstr = "search.matchWholeWord = true;";

 }

 

# Search type WORD: MatchAnyWord is turned on

if($type eq 'word'){ 

      $pdfstr .= "search.wordMatching =
'MatchAnyWord';search.query('$query', 'ActiveDoc');";

}

 

# Search type NEAR (Proximity): MatchAllWords is turned on

# Only 1 pair of search terms is analyzed

if($type eq 'near'){    

      $query =~ /(\w+)\s+NEAR([0-9]+)\s+(\w+)/;

      $query_dist = "$1 $3";

      $dist = $2;

      $pdfstr .= "search.wordMatching = 'MatchAllWords';search.proximity
= true;".

                  "search.proximityRange =
'$dist';search.query('$query_dist', 'ActiveDoc');";

}

 

# Search type PHRASE: MatchPhrase is turned on

if($type eq 'phrase'){ 

      $query =~ /"\s*((\w+\s+)*\w+)\s*"/;

      $query_phrase = $1;

      $pdfstr .= "search.wordMatching =
'MatchPhrase';search.query('$query_phrase', 'ActiveDoc');";

}

 

# Load PDF from URL

my $ua = LWP::UserAgent->new;

my $req = new HTTP::Request ('GET', $url);

my $response = $ua->request($req); 

 

# Create PDF JavaScript search object 

my $res = $response->content();

my $io = IO::String->new($res);

my $p = Text::PDF::File->open($io, 1); # 1 = write   

my $r = $p->read_obj($p->{'Root'});

my $t = $p->read_obj($r->{'Pages'});


my $x=PDFDict();

$x->{'Type'}=PDFName('Action');

$x->{'S'}=PDFName('JavaScript');

$x->{'JS'}=PDFStr($pdfstr);

$p->new_obj($x);

$r->{'OpenAction'} = $x;

$p->out_obj($r);

$p->append_file();


$p->release();

 

# Print modified PDF

print "Content-type:application/pdf\n\n";

print $res;

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 



_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Wed Oct 22 12:06:12 2008