Skip to main content.
home | support | download

Back to List Archive

Indexing largish document sets from filesystem with obeyRobotsNoIndex set causes core dump

From: Peter Farmer <peter.farmer(at)>
Date: Mon Jan 13 2003 - 10:41:28 GMT

We have been happily trialling swish-e on our clients HPUX 10.20 systems for 
several months now on sets of files between 20 and 10000 files. After 
realizing that we needed to exclude a small number of files in arbitrary 
places from being indexed i turned on  obeyRobotsNoIndex (as the files were 
already set up to block indexing from the web) . 

The indexing process then consistently core dumped on our largest data set ( 
at the same file every time ) unless I forced it to use HTML parsing. Of 
course obeyRobotsNoIndex then has no useful effect as it requires the HTML2 
parser (which I want to use anyway for all indexing anyway) . Not wanting to 
end up maintaining a lot of arbitrary FileRules entries unnecessarily. I have 
attempted to debug the problem.  What I have found so far is that it appears 
that a problem reported last year may still be present in the 2.2 code base. 
>From message 4541 in the archive :

Bill Moseley wrote :
> There was a bug in the code that handled removing files (when that no index
> meta tag is found swish has to back-out the additions to the index up to
> that point for the current file). But that should have been fixed. Maybe
> there's still another problem.

I have only been able to reproduce on the large data set (9k+ files) .
In my testing so far the segmentation violation occurs only if some files 
with the 
<meta name="robots" content="noindex">  tag
have been countered earlier in the indexing process (skipped "due to Robots 
Exclusion Rule in meta tag" )

The segv occurs in the CompressCurrentLocEntry routine  (compress.c) as 
swish-e is  indexing the next indexable file . 
The exact  point of  failure is line 594 :   next = l->next

When currentChunkLocation instance ptr (l)  is set to null a prior loop thru 
the hash list walker (a 'for' loop that only terminates when the 
entry->currentlocation marker  matches 'l' ) .
It appears that either the 'for' loop is lacking an extra loop termination 
test or , more likely, a prior modification to the currentChunkLocationList 
for the 'ENTRY' instance has failed to set the correct  link when terminating 
the currentChunkLocation chain.  My guess is that this could have happened 
during a remove_last_file_from_list() call  made during processing of the 
robot excluded files.
I have run out of time to follow this further on my own. Can I register this 
as a bug on Source Forge or is this posting sufficient  to have someone more 
familiar with the code to investigate ?

Some relevant info from gdb :

gdb /local/bin/swish-e-2.2.2 core
GNU gdb 5.2
Copyright 2002 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "hppa2.0-hp-hpux10.20"...
Core was generated by `swish-e-2.2.2'.
Program terminated with signal 11, Segmentation fault.

warning: The shared libraries were not privately mapped; setting a
breakpoint in a shared library will not work until you rerun the program.

Reading symbols from /usr/local/bin/swish-e...done.
Reading symbols from /usr/local/lib/
Reading symbols from /usr/lib/libM.1...done.
Reading symbols from /usr/local/lib/
Reading symbols from /usr/lib/libc.1...done.
Reading symbols from /usr/lib/libdld.1...done.
#0  CompressCurrentLocEntry (sw=0x40048be8, indexf=0x400e6530, e=0x401e89d4)
    at compress.c:594
594             next = l->next;

(gdb) bt
#0  0x000492ec in CompressCurrentLocEntry (sw=0x40048be8, indexf=0x400e6530, 
    e=0x401e89d4) at compress.c:594
#1  0x000129d0 in _dmatherr () at index.c:940
#2  0x00034d84 in printfile (sw=0x40048be8, 
    filename=0x400d9b00 "/data/WWW/cwco/index.html") at fs.c:601
#3  0x00034ea8 in printfiles (sw=0x40048be8, e=0x400d98d0) at fs.c:642
#4  0x00034914 in indexadir (sw=0x40048be8, 
    dir=0x400ef920 "/data/WWW/cwco") at fs.c:445
#5  0x00034ff4 in printdirs (sw=0x40048be8, e=0x400ef6d0) at fs.c:680
#6  0x00034924 in indexadir (sw=0x40048be8, dir=0x400f1300 
    at fs.c:446
#7  0x00035220 in fs_indexpath (sw=0x40048be8, path=0x400f1300 "/data/WWW/")
    at fs.c:733
#8  0x00029a1c in indexpath (sw=0x40048be8, path=0x400f1300 
    at file.c:193
#9  0x00010024 in cmd_index (sw=0x40048be8, params=0x400e5cf0) at swish.c:1121
#10 0x0000db3c in y1 () at swish.c:179


Peter Farmer            |  Custom XML software   | Internet Engineering 
Zveno Pty Ltd           | Website XML Solutions  | Training & Seminars   |   Open Source Tools    |   - XML XSL Tcl  +------------------------+---------------------
Ph. +61 8 92036380      | Mobile +61 417 906 851 | Fax +61 8 92036380
Received on Mon Jan 13 10:41:46 2003