Skip to main content.
home | support | download

Back to List Archive

[swish-e] Returning document title rather than file name in search results?

From: Greg Keith <Greg.Keith(at)not-real.noaa.gov>
Date: Wed Mar 11 2009 - 21:14:33 GMT
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
 
Hi all -

We're using swish-e 2.4.5 on a RHEL5.3 i686 32-bit Linux box, Perl
5.8.8. We have basic search up and running with swish.cgi to use as
search on our intranet, the index is built using the file system method.

My current issue is fairly minor - I am getting search results back
with the file name as the first element in the result, as here:

"Search the Intranet

Limit search to: Title & Body Title Document Path
Sort by: Reverse Sort
 Results for forms   1 to 15 of 35 results.     Run time: 0.009
seconds | Search time: 0.000 seconds   
 Page:1 2 3 Next 15

1 CAM_Chapter_13.pdf -- rank: 1000
    Commerce Acquisition Manual 1313.301 April 5, 2000 (Supercedes CAM
Chapter 13-1, dated July 1996) Current through CAM Notice 02-02
Department of Commerce Purchase Card Procedures Section 1  Purchase
Card Program Overview 1.1 1.2 1.3 1.4 Program Introduction and Purpose
Policy Definitions Roles and Responsibilities Section 2 - Obtaining
and Maintaining a ...
    Last Modified Date:    2009-02-26 16:58:34 MST
    Document Size:    223048
    Document Path:   
https://intranet/admin/bankcard/pdf/CAM_Chapter_13.pdf

2 STC_Travel_Policy.pdf -- rank: 927
    ... location and on the STC Intranet at www.stcintranet.com under
Policies & Forms. If an employee cannot locate this manual or this
Policy ...
    Last Modified Date:    2009-02-26 16:58:34 MST
    Document Size:    216697
    Document Path:   
https://intranet/admin/travel/pdf/STC_Travel_Policy.pdf

3 401(k)_Summary_Plan_Description.pdf -- rank: 645
    Deferred Salary Profit-Sharing Thrift Plan for Employees of
Science & Technology Corporation Table of Contents Introduction
Important Information about the Plan Joining the Plan Contributions to
the Plan Managing Your Account Ownership of Your Account (Vesting)
Loans Benefits Taxes on Distributions Distribution Claim Procedures
Legal Rights Additional Information 3 4 5 6 ...
    Last Modified Date:    2009-02-26 16:58:34 MST
    Document Size:    198125
    Document Path:   
https://intranet/admin/benefits/pdf/401(k)_Summary_Plan_Description.pdf

4 OF306.pdf -- rank: 556
    ... false statement on any part of this declaration or attached
forms or sheets may be grounds for not hiring you, or ...
    Last Modified Date:    2009-02-26 16:58:36 MST
    Document Size:    99898
    Document Path:    https://intranet/facilities/security/pdf/OF306.pdf"

...but I want the document title returned as the first link, if there
is one - most of the documents I'm indexing are HTML, so there should
be a <title> tag for most of them. I am not clear on how to do this -
it looks like it should be the proper combination of specifying the
title_property in swish.cgi and the MetaNames directive in my
swish.conf. However, I don't know what the proper combination is - I
tried  not having any MetaNames directive in the swish.conf, and
having title_property set to "title" rather than "swishtitle", but
this just produces a "(null)" result for each document found. My
swish.conf and swish.cgi are below.

Can anyone enlighten me?

Thanks!

Greg


===============

(swish.conf)

# Directory to index
IndexDir /intranet

# What to index
IndexOnly .htm .html .doc .pdf  .shtml .txt .pro

# Exclude all files of these types
FileRules filename contains \.(ppt|xls|gif|jpg|asp|php|png|css|js)$

# Don't index the RCS directories
FileRules dirname contains RCS

# Filters for non-native content
FileFilter .pdf /usr/local/bin/pdftotext   "'%p' -"
FileFilter .doc /usr/bin/catdoc "-s8859-1 -d8859-1 '%p'"

# Tell Swish-e that .txt files are to use the text parser.
IndexContents TXT* .txt .pro .pdf .doc

# Otherwise, use the HTML parser
DefaultContents HTML*

# The first "n" number of characters to store as the file description
in search results
#StoreDescription HTML* <title> 200
StoreDescription HTML* <body> 1000
StoreDescription TXT* 1000

# Ask libxml2 to report any parsing errors and warnings or
# any UTF-8 to 8859-1 conversion errors
ParserWarnLevel 9

# Specify which meta names to include in the index
MetaNames title body description swishtitle swishdescription swishdocpath

# These are the characters that are allowed in a "word".
# i.e. words are split on any character NOT found in WordCharacters
WordCharacters abcdefghijklmnopqrstuvwxyz0123456789.-

# We allow a period and a dash within words, but strip them
# from the beginning or end of a word.  This is done after
# WordCharacters above is used to split words.
IgnoreFirstChar .-
IgnoreLastChar  .-

# Finally, resulting words must begin/end with one
# of the characters listed here
BeginCharacters abcdefghijklmnopqrstuvwxyz0123456789
EndCharacters   abcdefghijklmnopqrstuvwxyz0123456789

# Perform a file system search, but return the results as URLs
ReplaceRules prepend "https://intranet/"
ReplaceRules remove "/intranet/"
 
================================

(swish.cgi)

#!/usr/bin/perl -w
package SwishSearch;
use strict;

# This is set to where Swish-e's "make install" installed the helper
modules.
use lib ( '/usr/local/lib/swish-e/perl' );


my $DEFAULT_CONFIG_FILE = '.swishcgi.conf';

# This is written this way so the script can be used as a CGI script
or a mod_perl
# module without any code changes.

# use CGI ();  # might not be needed if using Apache::Request


#=================================================================================
#   CGI entry point
#
#=================================================================================


use vars '$speedy_config';  # Global for caching in persistent
environment such as SpeedyCGI

# Run the script -- entry point if running as a CGI script

    unless ( $ENV{MOD_PERL} ) {
        if ( !$speedy_config ) {
            $speedy_config = default_config();

            # Merge with disk config file.
            $speedy_config = merge_read_config( $speedy_config );
        }

        process_request( $speedy_config );

    }


sub default_config {



    ##### Configuration Parameters #####
   

    return {
        title            => 'Search the PSD Intranet',  # Title of
your choice.  Displays on the search page
        swish_binary     => '/usr/local/bin/swish-e',  # Location of
swish-e binary
        config_file      => $DEFAULT_CONFIG_FILE,    # Default config file
        swish_index      =>
'/usr/local/lib/swish-e/conf/index.swish-e',    # Location of your
index file
        page_size        => 15,                 # Number of results
per page  - default 15
        #link_property   => 'swishdocpath',

        ## Display properties ##
        title_property   => 'swishtitle',
        description_prop => 'swishdescription',
        display_props    => [qw/swishlastmodified swishdocsize
swishdocpath/],
        sorts            => [qw/swishrank swishlastmodified swishtitle/],
        secondary_sort   => [qw/swishlastmodified desc/],
        metanames        => [qw/ swishdefault swishtitle swishdocpath /],

        meta_groups => {
            all =>  [qw/swishdefault swishtitle swishdocpath/],
        },

        name_labels => {
            swishdefault        => 'Title & Body',
            swishtitle          => 'Title',
            swishrank           => 'Rank',
            swishlastmodified   => 'Last Modified Date',
            swishdocpath        => 'Document Path',
            swishdocsize        => 'Document Size',
            all                 => 'All',              # group of
metanames
            subject             => 'Message Subject',  # other examples
            name                => "Poster's Name",
            email               => "Poster's Email",
            sent                => 'Message Date',
        },


        timeout         => 10,    # limit time used by swish when
fetching results - DoS protection.
        max_query_length => 100,  # limit length of query string.
Swish also has a limit (default is 40)
        max_chars       => 500,   # Limits the size of the
description_prop if it is not highlighted

        # This structure defines term highlighting, and what type of
highlighting to use
        # If you are using metanames in your searches and they map to
properties that you
        # will display, you may need to adjust the "meta_to_prop_map".

        highlight       => {

            package         => 'SWISH::PhraseHighlight',
            show_words      => 10,    # Number of "swish words" words
to show around highlighted word
            max_words       => 100,   # If no words are found to
highlighted then show this many words
            occurrences     => 6,     # Limit number of occurrences of
highlighted words
            #highlight_on   => '<b>', # HTML highlighting codes
            #highlight_off  => '</b>',
            highlight_on    => '<font style="background:#FFFF99">',
          highlight_off   => '</font>',

            meta_to_prop_map => {
                swishdefault    => [ qw/swishtitle swishdescription/ ],
                swishtitle      => [ qw/swishtitle/ ],
                swishdocpath    => [ qw/swishdocpath/ ],
            },
        },

        Xselect_indexes  => {
            # pick radio_group, popup_menu, or checkbox_group
            method  => 'checkbox_group',
            #method => 'radio_group',
            #method => 'popup_menu',

            columns => 3,
            # labels must match up one-to-one with elements in
"swish_index"
            labels  => [ 'Main Index', 'Other Index', qw/ two three
four/ ],
            description => 'Select Site: ',
            default_index => '',
        },

        Xselect_by_meta  => {
            #method      => 'radio_group',  # pick: radio_group,
popup_menu, or checkbox_group
            method      => 'checkbox_group',
            #method      => 'popup_menu',
            columns     => 3,
            metaname    => 'site',     # Can't be a metaname used
elsewhere!
            values      => [qw/misc mod vhosts other/],
            labels  => {
                misc    => 'General Apache docs',
                mod     => 'Apache Modules',
                vhosts  => 'Virtual hosts',
            },
            description => 'Limit search to these areas: ',
        },

        xtemplate => {
            package     => 'SWISH::TemplateDefault',
        },

        xtemplate => {
            package     => 'SWISH::TemplateDumper',
        },

        xtemplate => {
            package         => 'SWISH::TemplateToolkit',
            file            => 'swish.tt',
            options         => {
                INCLUDE_PATH    => '/usr/local/share/swish-e',
                #PRE_PROCESS     => 'config',
            },
        },

        xtemplate => {
            package         => 'SWISH::TemplateHTMLTemplate',
            options         => {
                filename            => 'swish.tmpl',
                path                => '/usr/local/share/swish-e',
                die_on_bad_params   => 0,
                loop_context_vars   => 1,
                cache               => 1,
            },
        },


        on_intranet => 1,
        no_first_page_navigation   => 0,
        no_last_page_navigation    => 0,
        num_pages_to_show          => 12,  # number of pages to offer
        limit_procs     => 0,  # max number of swish process to run
(zero to not limit)
        ps_prog         => '/bin/ps -Unobody -ocommand',  # command to
list number of swish binaries

    };

}

#^^^^^^^^^^^^^^^^^^^^^^^^^ end of user config
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (MingW32)
 
iD4DBQFJuCm58IR34NeP2BwRAsIOAJiKdSripMhPny1DLiWzKlH0ZHf3AJ9GqCS8
jDDEGMr8bTlbXc0IBbpG0w==
=RhZ6
-----END PGP SIGNATURE-----

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Wed Mar 11 17:14:32 2009