Skip to main content.
home | support | download

Back to List Archive

Indexing stops early

From: David Chisholm <david(at)not-real.lonecrow.net>
Date: Sat Nov 19 2005 - 16:51:23 GMT
Hi, I have used swish-e to crawl and index a few collections of sites 
with little or no troubles but I have just run into someting that I am 
stumped by.

I used spider.pl to crawl about 800 sites to a text file and now I am 
trying to index it via the prog method. It seems to give up after ~500 
pages indexed even though I am sure there are many thousand. It's as if 
it's interpeting something as a stop directive but I am not sure what. 
Is there something I can look for in my prog output?

I used this same method on a different set of about 400 sites and it 
worked fine so I don't think its either of my config files.

There is no error message it just gievs me indexing done after 516 files 
processed.

The text file I am feeding it is half a GB and probably contains 30,000 
to 40,000 webpages. What character does swish-e look for to mark EOF? or 
where to end indexing?

Here is some relevent information about how I am running it.

SWISH-E CONFIG:
	DefaultContents HTML2
	StoreDescription HTML2 <description> 10000
	IndexContents HTML* .htm .html .shtml .asp .aspx .cfm .php .cgi
	IndexOnly .htm .html .shtml .asp .aspx .cfm .php .cgi
	HTMLLinksMetaName swishdefault

SPIDERCONFIG.PL: (relevent portions, actual contains 762 sites)

   my %site[site1] = (
	base_url   => '[site1]',
	email      => 'admin@somedomain.com',
	agent      => 'somedomain.com/bot/',
	link_tags           => [qw/ a frame area /],
	keep_alive          => 1,
	test_url            => sub {  $_[0]->path !~ 
/\.(?:gif|jpeg|png|mov|avi|mp3|pdf)$/i },
	test_response => sub {
	my $content_type = $_[2]->content_type;
	    return $content_type =~ m!text/html!;
	},
	use_head_requests   => 1,  # Due to the response sub
	delay_sec   		=> 0,
	max_indexed   		=> 200,
	max_time		=> 2,
	remove_leading_dots => 1,
	credential_timeout 	=> undef,
..
@servers = (\%site[site1],...

CRAWLED WITH:
	spider.pl spiderconfig.pl > crawledlinks.txt

INDEXING WITH:
	swish-e -e -c swishconfig.cfg -S prog -i stdin < crawledlinks.txt -f 
index.links


RESULTS of the crawl is a text file of ~516MB.
RESULTS of indexing:
	Indexing Data Source: "External-Program"
	Indexing "stdin"
	Removing very common words...
	no words removed.
	Writing main index...
	Sorting words ...
	Sorting 13,037 words alphabetically
	Writing header ...
	Writing index entries ...
	  Writing word text: Complete
	  Writing word hash: Complete
	  Writing word data: Complete
	13,037 unique words indexed.
	5 properties sorted.
	516 files indexed.  5,308,261 total bytes.  244,662 total words.
	Elapsed time: 00:00:02 CPU time: 00:00:01
	Indexing done!

Q1) What should I be looking for in the output thats making it stop?

Q2) When it reports 5,308,261 total bytes, is that the position in the 
source text file including headers? Or is that simply the size of the 
portions actualy indexed?  E.g. If I went to that postion in my source 
text would I find the offending string that is causing it to stop.

Thanks for any advice or pointers.
DAve





*********************************************************************
Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
*********************************************************************
Received on Sat Nov 19 08:51:42 2005