Skip to main content.
home | support | download

Back to List Archive

strange indexing order in swish-e 2.2 rc1

From: Trond Nilsen <t.nilsen(at)not-real.alchemy.co.nz>
Date: Mon Sep 02 2002 - 22:08:11 GMT
Hey.

I've been playing about with Swish-E 2.2 rc1 a bit for the last day or two, 
mostly using the HTTP method.

For part of this time, I was convinced that it was arbitrarily labelling files 
as 'already indexed' during spidering.

A simple case demonstrates the strangeness

I'm indexing a site with three html files

foo.html, containing links to bar.html and bas.html
bar.html, containing links to foo.html and bas.html
bas.html, containing links to foo.html and bar.html

Pointing at foo.html for a start..


1  D:\temp\swish22test>d:\progra~1\swish-e22\swish-e -c test.cfg -S http
2
3  Indexing Data Source: "HTTP-Crawler"
4  Indexing "http://localhost/foo.html"
5  Returned 0
6  retrieving http://localhost/foo.html (0)...
7  Returned 0
8   - Using DEFAULT (HTML) parser -  (4 words)
9  retrieving http://localhost/bar.html (1)...
10 Returned 0
11  - Using DEFAULT (HTML) parser -  (4 words)
12 Skipping http://localhost/foo.html:  Already indexed.
13 Skipping http://localhost/bas.html:  Already indexed.
14 retrieving http://localhost/bas.html (1)...
15 Returned 0
16  - Using DEFAULT (HTML) parser -  (4 words)
17 Skipping http://localhost/foo.html:  Already indexed.
18 Skipping http://localhost/bar.html:  Already indexed.

   <snip rest of output>

Note that by the time swish has reached line 13, it has indexed 'foo.html', 
and 'bar.html', but not 'bas.html'. However, it proceeds to incorrectly say 
that it is 'Already indexed'.

I presume that what's happening is that swish is building up a list of pages 
to index, and delivering the 'Already indexed' message based on whether a page 
exists within that list, when it should really be just ignoring duplicates. 
Already indexed should be reserved for files that have indeed been indexed 
already.

It took me a while to isolate this - indexing a real site, I got large 
quantities of 'Already indexed' messages which hid what was going on, and 
confused the hell out of me.

Is this a known problem? Or some sort of idiosyncracy I've managed to 
introduce in my config. I've got the test site and my config file here if 
anyone wants to take a look. Also, I'm using the Swish-E 2.2 rc1 windows 
installed version.

If I have time this afternoon, and I've not heard anything else, I might go 
bug hunting myself..

-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Trond Nilsen                                                   Alchemy Group
Software Engineer                                   http://www.alchemy.co.nz
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Received on Mon Sep 2 22:11:42 2002