Skip to main content.
home | support | download

Back to List Archive

Re: swish.cgi results no path in title

From: Aaron Bazar <aaronb(at)not-real.spamcop.net>
Date: Mon Sep 15 2003 - 00:33:49 GMT
Wow, your process was able to narrow it down!

OK. I have a slightly tweaked CGI script... so, I started with the default
swish.cgi that comes with the installation. It still did not work. I changed
out my template.tmpl file with the one that came with the download... it
still did not work. Then... I turned off "use_library"... in other words, I
used the binary instead of the PERL API and It worked!

Just to make sure it was not something else, I went to my new installation
(newest release of swish) and made a fresh index of the document like you
did. I performed the query with the newest version of swish.cgi with, and
without, use_library=1 ... When using the binary, the listing comes up fine,
just like yours. When using the perl API, it comes up without the name of
the document.

The next test I did was create a document without a title tag, but not huge
like the page I found the issue with. On a normal sized page the same thing
happens. If there is not a title tag, it still would not use the docpath
when using the API.

Finally, I reindexed the file without using spider.pl ... I did it directly
from the filing system. The issue still happened when using swish.cgi. So,
it would seem, something is going on with the API interface to the index as
opposed to the swish-e binary.

I found this in the cgi script:

      my %props;


        for my $prop ( @props ) {
            # Note, we use ResultPropertyStr instead since this is a general
purpose
            # script (it converts dates to a string, for example).
            # $result->Property is a faster method and does not convert
dates and numbers to strings.
            #my $value = $result->Property( $prop );
            my $value = $result->ResultPropertyStr( $prop );
            next unless $value;  # ??

            $props{$prop} = $value;
        }

        $hit_count++;

        $self->add_result_to_list( \%props );

        last unless --$page_size;
    }


and changed it to this:


        my $value = $result->ResultPropertyStr( $prop );
            next unless $value;  # ??

      $props{$prop} = $value;

###MY CHANGE####
if ($props{"swishtitle"} ==""){$props{"swishtitle"} ="Untitled"};

###END MY CHANGE####
        }






That seemed to do the trick for me. Now, Untitled comes up when there is no
title. I suspect there is a better way to do this, but I would actually
rather have "Untitled" instead of a  the document path (the document path
can get quite long sometimes and mess up my tables).


Thanks for your help!

Best regards,


Aaron Bazar
http://www.topiasearch.com




Quoting moseley@hank.org:

> On Sat, Sep 13, 2003 at 10:59:31AM -0700, Aaron Bazar wrote:
> > I am not quite sure what you mean. Perhaps I was not clear.
>
> Right.  What I mean is that you provide details so I can reproduce the
> problem.  It's not very efficient otherwise -- you have sent two emails
> and your problem still isn't solved, and I just spent 45 minutes trying
> various things and still couldn't reproduce your problem.
>
> > I have an index with thousands of documents that I use swish.cgi to
search.
> >
> > When results are returned, most show up fine. However, if the original
> > HTML document did not have a title, then it shows up in the results
> > list without a title... so there is nothing to "click-on"
> >
> > Here is an example:
> >
> > http://www.healthfind.org/health/weight+loss
>
> Yes, I see that.  It's odd.  (And a 2.8M web page is a bit long, I'll
> note.)
>
> Since you didn't send a way to reproduce it easily, I tried it myself:
> I used "view source" and I could see the URL of the original page.  I
> fetched it with:
>
> moseley@laptop:~/apache$ wget www.megafitness.com/export.html
>
> Then indexed it:
>
> moseley@laptop:~/apache$ cat c
> Defaultcontents HTML*
> StoreDescription HTML* <body> 100000
> SwishProgParameters default http://localhost/apache/export.html
>
> moseley@laptop:~/apache$ swish-e -S prog -i spider.pl -c c
> (geeze, takes a minute and a half to index that one page on my laptop!)
>
> Now search:
>
> moseley(at)not-real.laptop:~/apache$ GET http://localhost/apache/swish.cgi?query=word
|
> grep rank:
>         <dt>1 <a
href="http://localhost/apache/export.html">export.html</a>
> <small>-- rank: <b>1000</b></small></dt>
>                                                            ^^^^^^^^^^^^
> And there's the path name used as the title -------------------^
>
>
> So maybe something weird with spidering directly from that site.  So
> just to be sure I then used this config:
>
> moseley@laptop:~/apache$ cat c
> Defaultcontents HTML*
> StoreDescription HTML* <body> 100000
> #SwishProgParameters default http://localhost/apache/export.html
> SwishProgParameters default http://www.megafitness.com/export.html
>
> And started indexing.  After a few minutes I sent spider.pl a SIGHUP to
> tell it to quit spidering:
>
> moseley@laptop:~/apache$ kill -HUP 6556
>
> And then searched as above and the title was there.
>
>
> So what's different?  I have no idea.
>
> Did you test to see which program is not returning the title (swish-e or
> swish.cgi)?
>
> Are you using some other configuration than I'm using?
>
> Are you using something other than the default swish.cgi template
> setting?  I tried all the templates that come with swish.cgi and they
> all worked.
>
> Again, if you want help you need to provide an easy way for me to see
> the problem and, hopefully, reproduce it on my machine.
>
> Or better, since I provided all my steps above, try that, and if that
> works then see how your configuration is different.
>
>
>
>
> >
> > The second result is what I am talking about.
> >
> > Thanks!
> >
> > Aaron Bazar
> >
> >
> >
> > -----Original Message-----
> > From: swish-e@sunsite.berkeley.edu
> > [mailto:swish-e@sunsite.berkeley.edu]On Behalf Of moseley@hank.org
> > Sent: Saturday, September 13, 2003 1:23 PM
> > To: Multiple recipients of list
> > Subject: [SWISH-E] Re: swish.cgi results no path in title
> >
> >
> > On Sat, Sep 13, 2003 at 06:39:34AM -0700, Aaron Bazar wrote:
> > > Hi,
> > >
> > > I have run into an issue with the swish.cgi in version 2.4... Some
html
> > > pages that I index do not have a <title> tag .. as far as I know, if
> there
> > > is no title then swish is supposed to use the docpath as the title.
> > However,
> > > this is not happening. I end up with nothing in the title...
consequently
> > > there is no link- just the rank and description. I have been trying to
> > find
> > > where in the perl code this is, with no luck. Basically, if there is
no
> > > swishtitle, I would like to put in a default like "Untitled" (or even
the
> > > docpath like it is supposed to work)
> >
> > Try and support what you are saying with examples.  Like this:
> >
> > moseley@laptop:~$ cat 1.html
> > <html>
> > <head>
> > <title></title>
> > </head>
> > <body>
> > bodyword
> > </body>
> >
> > moseley@laptop:~$ swish-e -i 1.html -v0
> > moseley@laptop:~$ swish-e -w bodyword
> > # SWISH format: 2.4.0-pr1
> > # Search words: bodyword
> > # Removed stopwords:
> > # Number of hits: 1
> > # Search time: 0.003 seconds
> > # Run time: 0.087 seconds
> > 1000 1.html "1.html" 63
> > .
> >
> >
> > --
> > Bill Moseley
> > moseley@hank.org
> >
>
> --
> Bill Moseley
> moseley@hank.org
>
Received on Mon Sep 15 00:33:57 2003