Skip to main content.
home | support | download

Back to List Archive

[swish-e] spider.pl - modified for bad dynamic pages; swish.cgi modified

From: Han-Kwang Nienhuys <h.nienhuys(at)not-real.amolf.nl>
Date: Wed Aug 08 2007 - 14:12:08 GMT
Hi,

I've been playing with swish-e 2.4.5 for a couple of days (for
indexing an intranet with around 8000 pages and PDFs) and have been
modifying the spider scripts a bit. Not all mods are suitable for the
rest of the world, but some are and I wonder whether it would be
appreciated if I clean them up for general use.

1. Spider traps

I encountered a couple of cgi/php scripts that generated nearly
infinite numbers of unique URIs. I first tried filtering the URLs with
regexps, but I added a feature that URIs with more than 2
(user-definable) CGI parameters are counted and after a certain
user-definable number of similar URLs, the spider stops fetching them.

   http://example.com/foo.php?a=1&b=2&c=3

is indexed, but counted as

   http://example.com/foo.php

and after 10 (user-definable) times the spider stops following links.

2. Bad LaTeX-generated PDF. Some LaTeX installations generate PDFs
with a nonstandard font encoding, which are transformed by
pdftotext into loads of garbage. I try to catch them with a rather
ad-hoc regexp which seems to work - not really distribution-quality
code. :-)

3. One of our intranet servers delivers everything, including PDFs, as
content-type text/something. I'm filtering that as well. Also
questionable code for general use.

4. If I enable a metagroup 'all' in swish.cgi in order to search for
keywords that are either in the title/body or in the URL, it doesn't
work as expected. The reason is that a query "a b" is expanded to
to something like

  swishdefault=(a b) OR swishtitle=(a b) OR swishdocpath=(a b)

but it won't find anything. I replaced it by
                        
  swishdefault=(a OR b) OR swishtitle=(a OR b) OR swishdocpath=(a OR b)

but the ranking algorithm doesn't seem to give a bonus to documents
that contain both a and b somewhere. To really fix this, the indexer
should be made able to create a metaname database column for words
that are in any of swishdefault and swishdocpath. However, I couldn't
find any suitable configuration options and I'm not sure I'm willing
to invest the time to figure out how to modify the source code myself.

5. I added the minus sign "-" as an alias for the NOT operator in CGI
queries, so that people used to Google don't have to remember a
different syntax.

Han-Kwang

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Wed Aug 8 10:12:09 2007