Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] spider.pl - modified for bad dynamic pages;

From: Han-Kwang Nienhuys <h.nienhuys(at)not-real.amolf.nl>
Date: Wed Aug 08 2007 - 15:11:17 GMT
On Wed, 08 Aug 2007, Bill Moseley wrote:

> >    http://example.com/foo.php?a=1&b=2&c=3
> > is indexed, but counted as
> >    http://example.com/foo.php
> > and after 10 (user-definable) times the spider stops following links.
> You mean it just strips the query parameter.  Might be useful for some
> sites, but not every.  Might be a nice options.

No, I mean that this happens:

http://example.com/foo.php?a=1&b=2&c=0 - indexed
http://example.com/foo.php?a=1&b=2&c=1 - indexed
http://example.com/foo.php?a=1&b=2&c=2 - indexed
http://example.com/foo.php?a=1&b=2&c=3 - indexed
http://example.com/foo.php?a=1&b=2&c=4 - indexed
http://example.com/foo.php?a=1&b=2&c=5 - indexed
http://example.com/foo.php?a=1&b=2&c=6 - indexed
http://example.com/foo.php?a=1&b=2&c=7 - indexed
http://example.com/foo.php?a=1&b=2&c=8 - indexed
http://example.com/foo.php?a=1&b=2&c=9 - indexed
http://example.com/foo.php?a=1&b=2&c=10 - not indexed, already 10 similar URLs
http://example.com/bar.php?a=1&b=2&c=1 - indexed
http://example.com/baz.php?a=10 - indexed unconditionally because only
   a single CGI parameter
http://example.com/foo.php?a=11 - also indexed unconditionally

I already got stuck in three spider traps on three different servers
within our domain, and I'd prefer to not manually interrupt spider
sessions and filter certain URLs by hand every time someone puts up a
buggy php script without thinking of robots.txt or robots meta tags. I
know that Google does not like URLs with an excessive number of CGI
parameters, especially if the parameters are long numbers (possibly
sessionids).

> >   swishdefault=(a b) OR swishtitle=(a b) OR swishdocpath=(a b)
> > but it won't find anything. I replaced it by
> >   swishdefault=(a OR b) OR swishtitle=(a OR b) OR swishdocpath=(a OR b)
> But that's wrong.  If you search "a b" you are searching for a AND b.
> That "all" feature is just suppose to search multiple metanames.

Yes I agree, but the first is even more wrong IMO. I have a collection
of pages with URIs like "http://example.com/publications/pub1234.html"
which contains in the body the name of author "john" and title "About
search engines".

and it really took me a long time (playing with the configuration
files, re-indexing for half an hour, until I checked the cgi parser
code) before I understood why a search for

  publications john

wouldn't give any publications by john, even though

  publications pub1234

and

  john search engine

would both return the right page.

> Peter, shouldn't the ranking be ranking multiple matches higher?

Well, I didn't dive into the code, but with the OR search,
i.e. swishdefault=(john OR publications) OR swishpath=(john OR
publications) all the top results had either john or publications
multiple times.

> > 5. I added the minus sign "-" as an alias for the NOT operator in CGI
> Do you have a patch against svn?

I'm very green regarding version control systems and patch
files. Should I give you a "diff -u" from the original 2.4.5 version
to the patched version?

I now remember that I made another modification in the spider:

6. allow the spider to follow links from one server to another within
our intranet domain, i.e. start at www.example.com and also index
links to wiki.example.com, caesrv.example.com, an so on, without
having to list the servers one by one. I did it by replacing the
server check by a regexp check against our domain name. This feature
doesn't integrate well with the concept of a per-server configuration,
but it works for me.

Han-Kwang
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Wed Aug 8 11:11:23 2007