Skip to main content.
home | support | download

Back to List Archive

Indexing (test_url)

From: Richard Vaillancourt <Richard.Vaillancourt(at)not-real.sit.ulaval.ca>
Date: Fri Jul 29 2005 - 15:19:30 GMT
We currently use the index search engine Swish-e version 2.2.3 on Linux RedHat for indexing pages in our CMS Jahia 4.05 (www.jahia.org).

If a user authenticates himself on a Web page, we find the same Web page then with different URL since Jahia add at the end of the URL the jsessionid parameters.  Here an example:

For the following Web page:
http://www.sit.ulaval.ca/sgc/etudiants/cache/offonce/pid/2659

If all the users are authenticate for this page, we will find in the Web cache of Jahia several Web pages:

http://www.sit.ulaval.ca/sgc/etudiants/cache/offonce/pid/2659;jsessionid=F1DDB0B8E010D86DA452AA25B85749C5
http://www.sit.ulaval.ca/sgc/etudiants/cache/offonce/pid/2659;jsessionid=F1DDB0B8E010D86DA452AA25B85749C4
http://www.sit.ulaval.ca/sgc/etudiants/cache/offonce/pid/2659;jsessionid=F1DDB0B8E010D86DA452AA25B85749C3
etc.

If I put a filter in Swish to remove the pages containing the string ";jsessionid", I then remove the indexation of the http://www.sit.ulaval.ca/sgc/etudiants/cache/offonce/pid/2659 page.

How to remove the indexing of all pages that contain the
";jsessionid" string without removing the single page without parameters (which is never call alone without parameters).

One solution would be to modify the URI variable in the subroutine
test_url in the Perl script "SwishSpiderConfigSit.pl" by adding the following line:

sub test_url {
    my ( $uri, $server ) = @_;
    # return 1;  # Ok to index/spider
    # return 0;  # No, don't index or spider;

    # ignore any common image files
  
    # ***************************************************
    # My new line
    $uri->path =~ s/\;jses.+//;
    # ***************************************************

    return $uri->path !~ /\.(gif|jpg|jpeg|png)?$/;

}

The problem is that the URI variable is in read only.  Does somebody have ideas or solutions to help me?

Thanks.

-------------------

French :

Nous utilisons le moteur de recherche Swish-e version 2.2.3 sur Linux RedHat pour indexer les pages sur notre CMS Jahia 4.05 (www.jahia.org).

Mais en indexant, plusieurs URL contiennent des ;jsessionid=... à la fin d'URL, qui faut enlever dans l'URL pendant l'indexation?

Exemple :
http://www.sit.ulaval.ca/sgc/etudiants/cache/offonce/pid/2659;jsessionid=F1DDB0B8E010D86DA452AA25B85749C5
http://www.sit.ulaval.ca/sgc/etudiants/cache/offonce/pid/2659  <-- OK

Nous avons essayé de modifier la routine test_URL du programme Perl SwishSpiderConfigSit.pl sans succès :

sub test_url {
    my ( $uri, $server ) = @_;
    # return 1;  # Ok to index/spider
    # return 0;  # No, don't index or spider;
    # Ajout de la ligne suivante pour enlever le ;jsessionid=... (sans succès)
    $uri->path =~ s/\;jses.+//;
    # ignore any common image files

    return $uri->path !~ /\.(gif|jpg|jpeg|png)?$/;

}

Comment faire pour indexer les liens URL en enlevant sur les URL les ;jsessionid= .

Merci d'avance
 
Richard Vaillancourt
SIT, Division des systèmes
Pavillon Casault, Université Laval, Ste-Foy, Canada, G1K 7P4
Richard.Vaillancourt@sit.ulaval.ca
Tél: 418-656-2131 poste 6280,  Télécopieur: 418-656-7305
www: http://www.sit.ulaval.ca/pp/rva/rva.html
Received on Fri Jul 29 08:19:42 2005