Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] spider.pl - modified for bad dynamic pages;

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Aug 08 2007 - 14:25:29 GMT
On Wed, Aug 08, 2007 at 04:12:08PM +0200, Han-Kwang Nienhuys wrote:
> 1. Spider traps
> 
> I encountered a couple of cgi/php scripts that generated nearly
> infinite numbers of unique URIs. I first tried filtering the URLs with
> regexps, but I added a feature that URIs with more than 2
> (user-definable) CGI parameters are counted and after a certain
> user-definable number of similar URLs, the spider stops fetching them.
> 
>    http://example.com/foo.php?a=1&b=2&c=3
> 
> is indexed, but counted as
> 
>    http://example.com/foo.php
> 
> and after 10 (user-definable) times the spider stops following links.

You mean it just strips the query parameter.  Might be useful for some
sites, but not every.  Might be a nice options.

> 2. Bad LaTeX-generated PDF. Some LaTeX installations generate PDFs
> with a nonstandard font encoding, which are transformed by
> pdftotext into loads of garbage. I try to catch them with a rather
> ad-hoc regexp which seems to work - not really distribution-quality
> code. :-)
> 
> 3. One of our intranet servers delivers everything, including PDFs, as
> content-type text/something. I'm filtering that as well. Also
> questionable code for general use.

Yes, another approach is to download everything and then use something
like file magic numbers to determine the content type.  But, that's
not without problems, either.


> 
> 4. If I enable a metagroup 'all' in swish.cgi in order to search for
> keywords that are either in the title/body or in the URL, it doesn't
> work as expected. The reason is that a query "a b" is expanded to
> to something like
> 
>   swishdefault=(a b) OR swishtitle=(a b) OR swishdocpath=(a b)
> 
> but it won't find anything. I replaced it by
>                         
>   swishdefault=(a OR b) OR swishtitle=(a OR b) OR swishdocpath=(a OR b)

But that's wrong.  If you search "a b" you are searching for a AND b.
That "all" feature is just suppose to search multiple metanames.


> 
> but the ranking algorithm doesn't seem to give a bonus to documents
> that contain both a and b somewhere. To really fix this, the indexer
> should be made able to create a metaname database column for words
> that are in any of swishdefault and swishdocpath. However, I couldn't
> find any suitable configuration options and I'm not sure I'm willing
> to invest the time to figure out how to modify the source code myself.

Peter, shouldn't the ranking be ranking multiple matches higher?


> 5. I added the minus sign "-" as an alias for the NOT operator in CGI
> queries, so that people used to Google don't have to remember a
> different syntax.

That's nice.

Do you have a patch against svn?

Thanks,

-- 
Bill Moseley
moseley@hank.org

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Wed Aug 8 10:25:30 2007