Skip to main content.
home | support | download

Back to List Archive

Re: Q: Swish-E foreign language character support

From: Kati Gbler <katigaebler(at)not-real.topmail.de>
Date: Mon Feb 05 2001 - 00:42:38 GMT
On Sunday 04 February 2001 11:57, Rainer.Scherg@rexroth.de wrote:
> Which version of swish did you try?
>
> If you tried the 1.3x version, please switch to the 2.0.x version.

Thanks!, the new version also has a much better install process, I 
just got Swish 2.04 up and running, and searching a test-index using a Perl 
script I found in my hosting provider's free CGI library. It now works with 
the non-English characters, brilliant!.

Only one more detail; I would like some advise on how to include a meta 
description of the page in the search results. Currently, this is what the 
result page would look like when using the CGI script as it came:

-- example page --

Search Results

Keywords: word1

1000 <link>"Page title"</link> (1321 bytes)

--- end of page --

To enable the description, in my Swish config file, I added:
PropertyNames description
And the spider then picks up the info from a HTML page:
<meta name="description" content="this is a page description..">
And I now can return it using the -p option on the command line.

But unfortunately I'm not much of a Perl Wizard, and so I can't quite figure 
out how to add it in the Perl script. Could someone kindly have a look at the 
script let me how to use -p with it so the description could be placed under 
the link of each search results.

The script comes in three parts:
1) search.pl, its the part I can't fix.
2) util.pl, header and footer stuff.
3) search.html, the HTML form. 

But I guess it would only be necessary to change something in the first part, 
below:

#!/usr/bin/perl
#
# search.pl
#
# simple interface to SWISH
#

require 'util.pl';

$| = 1;  # unbuffer the data

# whereis swish
$swishexec = "/home/kati/public_html/swish/src/swish-e";
unless (-e $swishexec) {
  &print_header_info("Cannot open $swishexec");
  print <<ENDERROR;
  <h2><u>Cannot open $swishexec</u></h2>
  Cannot open \"$swishexec\".  File not found or permission denied.
  <p>
ENDERROR
  &print_footer_info();
  exit(0);
}

# get the form data
&parse_form_data(*array);

# required variable in the html form:
#    --swishindex, keywords
# optional fields:
#    --maxresults 
if ($array{'swishindex'} eq "") {
  # not happy crappy
  &print_header_info("Swishindex Variable Not Specified");
  print <<ENDERROR;
  <h2><u>Form Incomplete</u></h2>
  The form is incomplete.... no \"swishindex\" variable is
  available.  The \"swishindex\" variable specifies the
  pathname to the swish index.  
  <p>
ENDERROR
  &print_footer_info();
  exit(0);
}
if ($array{'keywords'} eq "") {
    # not happy crappy
  &print_header_info("Data Incomplete");
  print <<ENDERROR;
  <h2><u>Data Incomplete</u></h2>
  Your request to search has been
  rejected due to insufficient information.  To properly send
  your search request, please provide one or more keywords.  
  <p>
  search example 1: john and doe or jane<br>
  search example 2: john and (doe or jane)<br>
  search example 3: not (john or jane) and doe<br>
  search example 4: j* and doe<br>
  <p>
ENDERROR
  &print_footer_info();
  exit(0);
}

# everything is happy, open up a pipe to the swish executable
$command = "$swishexec -f $array{'swishindex'} -w \"$array{'keywords'}\""; 
if ($array{'maxresults'} ne "") {
  $command .= " -m $array{'maxresults'}";
}
&print_header_info("Search Results");
print "<h2>Search Results</h2>\n";
print "Keywords: <b>$array{'keywords'}</b>\n<p>\n";
open(SWISH, "$command|");
while (<SWISH>) {
  # results of swish can be-
  #         line beginning with "#"
  #         line beginning with "."
  #         line beginning with "err"
  #         line beginning with "search words:"
  #         line beginning with relevance rank [0-9]
  if (/^\./) {
    last;
  }
  elsif (/^err:/) {
    print "$_";
    last;
  }
  elsif (/^[0-9]/) {
    chop;
    # can't simply split because spaces can exit in title
    $firstspace = index("$_", "\ ", 0); 
    if ($firstspace == -1) {
      next;
    }
    $secondspace = index("$_", "\ ", ($firstspace+1)); 
    if ($secondspace == -1) {
      next;
    }
    $lastspace = rindex("$_", "\ ");
    if ($lastspace == -1) {
      next;
    }
    $rank = substr($_, 0, $firstspace);
    $url = substr($_, ($firstspace+1), ($secondspace-$firstspace-1));
    $title = substr($_, ($secondspace+1), ($lastspace-$secondspace-1));
    $numbytes = substr($_, ($lastspace+1));
    print "$rank <a href=\"$url\">$title</a> ($numbytes bytes)<br>\n";
  }
}
close(SWISH);
if ($ENV{'PATH_INFO'} ne "") {
  print <<RETURNURL;
  <p>
  <a href=\"$ENV{'PATH_INFO'}\">Back to search form</a>
  <p>
RETURNURL
}
print "<p>\n";
&print_footer_info();

##############################################################################
# eof search.pl

--------------------------------------------------------------------


Below part just does a header and footer, it isn't of much use, but the above 
script doesn't run if I exclude it.

#
# util.pl
#
# utilities file with common subroutines
# used by pretty much all of the library CGI scripts
#

##############################################################################
# common subroutines 
##############################################################################

################################################
# get the variables by calling parse_form_data
# for example, "&parse_form_data(*array)"
# thanks Stacey  :)
sub parse_form_data
{
  local (*FORM_DATA) = @_;
  local ($request_method, $query_string, @key_value_pairs,
         $key_value, $key, $value);

  $request_method = $ENV{'REQUEST_METHOD'};

  if ($request_method eq "GET") {
    $query_string = $ENV{'QUERY_STRING'};
  } elsif ($request_method eq "POST") {
    read(STDIN, $query_string, $ENV{'CONTENT_LENGTH'});
  } else {   # neither POST nor GET
    $query_string = $ENV{'QUERY_STRING'};
  }

  @key_value_pairs = split(/&/, $query_string);

  foreach $key_value (@key_value_pairs) {
    ($key, $value) = split (/=/, $key_value);
    $key =~ tr/+/ /;
    $value =~ tr/+/ /;
    $value =~ s/%([\dA-Fa-f][\dA-Fa-f])/pack ("C", hex($1))/eg;

    if (defined($FORM_DATA{$key})) {
      $FORM_DATA{$key} = join("|", $FORM_DATA{$key}, $value);
    } else {
      $FORM_DATA{$key} = $value;
    }
  }
}

################################################
# print the footer information
sub print_footer_info
{ 
  print "</td>\n";
  print "</tr>\n";
  print "</table>\n";

  # print out the copyright footer
  # NOTE TO RESELLERS/CLIENTS: this is a library specific file,
  # delete or comment out for you own use
  if (-e "/www/htdocs/includes/copyright.txt") {
    open(COPYRIGHT, "/www/htdocs/includes/copyright.txt");
    while (<COPYRIGHT>) {
      print $_;
    }
    close(COPYRIGHT);
  }
  # print out the colorstrip footer
  # NOTE TO RESELLERS/CLIENTS: this is a library specific file,
  # delete or comment out for you own use
  if (-e "/www/htdocs/includes/colorstrip.txt") {
    open(COLORSTRIP, "/www/htdocs/includes/colorstrip.txt");
    while (<COLORSTRIP>) {
      print $_;
    }
    close(COLORSTRIP);
  }

  print "</body>\n";

  # close it out
  print "</html>\n"; 
} 
  
################################################
# print the header information
sub print_header_info
{
  local ($title) = @_;

  print "Content-type: text/html\n\n";

  # print out the title
  print "<html>\n";
  print "<head> \n";
  print "<title>$title</title>\n";
  if (-e "/www/htdocs/includes/javascript/main.js") {
    print "<script Language=\"JavaScript\">\n";
    print "<!--\n";
    open(JS, "/www/htdocs/includes/javascript/main.js");
    while (<JS>) {
      print $_;
    }
    print "//-->\n";
    print "</script>\n";
  }
  print "</head> \n";
   
  # print out the header, which should include a <body> tage
  if (-e "/www/htdocs/includes/body.txt") {
    open(BODY, "/www/htdocs/includes/body.txt");
    while (<BODY>) {
      print $_;
    }
    close(BODY);  
  }
  else {
    print "<body bgcolor=\"#ffffff\">\n";
  }

  # print out the toolbar
  # NOTE TO RESELLERS/CLIENTS: this is a library specific file,
  # delete or comment out for you own use
  if (-e "/www/htdocs/includes/toolstrip/support_sub.txt") {
    open(TOOLBAR, "/www/htdocs/includes/toolstrip/support_sub.txt");
    while (<TOOLBAR>) {
      print $_;
    }
    close(TOOLBAR);  
  }
  print <<ENDHEADER;
  <table>
  <tr>
  <td width=600>
ENDHEADER
}

################################################
# print an error
sub return_error
{
  local ($message) = @_;

  print <<ENDERROR;
  &#160;<br>
  <h2><u>Unknown Error</u></h2>
  An unknown error has been encountered.
  The error message is listed below:
  <p>
  <ul>
  <b>$message</b>
  </ul>
  <p>
ENDERROR
  &print_footer_info();
  exit(1);
}

##############################################################################
# eof util.pl

1;


----------------------------------------

Below is the HTML form, it has the Swish index defined in one of its hidden 
input fields, which is handy for later modifying with Javascript if using 
different indexes.

<html>
<head>
<title>Search Swish-E Index</title>
</head>
<body>
<h1>Search Swish-E Index</h1>

<form method="GET" action="cgi-bin/search.pl">

<input type="hidden" name="swishindex" 
value="/home/kati/public_html/swish/test.index">

<b>Search for the following keywords:</b><br>
<input name="keywords" size=40 maxlength=512>
<p>

<b>Maximum number of results:</b><br>
<input name="maxresults" size=5 value=40 maxlength=64>
<p>

<input type="submit" value="Search"> <input type="reset" value="Reset">
<p>
__________________________________________<p>
search example 1: john and doe or jane<br>
search example 2: john and (doe or jane)<br>
search example 3: not (john or jane) and doe<br>
search example 4: j* and doe<br>
<p>

</form>

</body>
</html>


--------------------------

Also, does anyone has any CGIs that I could test for the form and 
processing?, I couldn't find much of that on the Swish-E site.

Thanks!
Kati

"I'm prepared for all emergencies but totally unprepared for everyday life."





>
>
> Development of 2.0.x is done at http://swishe.sourceforge.net  (docs, etc).
> Download of latest source is also available at http://www.boe.es/swish-e/
> or via links from http://sunsite.berkeley.edu/SWISH-E/
>
>
> tschuess... Rainer
>
> > -----Original Message-----
> > From: Kati Gbler [mailto:katigaebler@topmail.de]
> > Sent: Sunday, February 04, 2001 10:25 AM
> > To: Multiple recipients of list
> > Subject: [SWISH-E] Q: Swish-E foreign language character support
> >
> >
> > Hello Swish-E users,
> >
> > I just set up Swish-E for the first time, and I got it
> > working successfully
> > using the HTTP method, and doing command line, or CGI-form searching.
> >
> > The only feature I'm missing is support for some of the
> > foreign language
> > characters. For example, commonly used German characters are:
> > . And
> > various Spanish, Italian, and French characters are: 
> >   etc., they
> > don't get returned in the search results, I guess because they're not
> > indexed, as it states in the Swish-E config file, only use these ones:
> >
> > abcdefghijklmnopqrstuvwxyz0123456789_\|/-+=?!@$%^'"`~,.[]{}()
> >
> > In any case, I tested adding the characters  in the config
> > file and in the
> > HTML files, and re-indexed, but as I expected, it didn't
> > work. Although, I
> > also tested using German characters () in the META tags
> > ("PropertyNames"),
> > and notably in that situation they worked when I did a
> > command line search
> > with the -p option.
> >
> > I have a site in Spanish, Italian, French, German and
> > English, so for my
> > purpose its important to make these foreign characters work.
> > Does anyone know
> > a fix to this? or would much of the Swish-E code need to be
> > re-built to make
> > it work?. As such, maybe it would be better for me to find
> > another search
> > engine, any advise on this foreign-character problem would be
> > appreciated!
> >
> > Regards,
> > Kati
> >
> > PS: I just joined the list, so I'm not sure if its working,
> > please include a
> > CC of any replies to me at katigaebler@topmail.de. Thanks.
> >
> > --
> >
> > Rules:
> >         (1)  The boss is always right.
> >         (2)  When the boss is wrong, refer to rule 1.
> >
> >
> > -----------------------------------------------------------
> > This Mail has been checked for Viruses
> > Attention: Encrypted Mails can NOT be checked !
> >
> > ***
> >
> > Diese Mail wurde auf Viren ueberprueft
> > Hinweis: Verschluesselte Mails koennen NICHT geprueft werden!
> > ------------------------------------------------------------
Received on Mon Feb 5 00:46:20 2001