Skip to main content.
home | support | download

Back to List Archive

Funky, unknown errors

From: Alan Ivey <ai4891(at)not-real.yahoo.com>
Date: Fri Jul 02 2004 - 12:26:51 GMT
I'm using the internal spider because I (unexperienced
at Perl) found it easier to use FileFilters. I ran
some short tests (Max_Level at 2 or 3) and everything
worked great. Wednesday afternoon I wanted to make an
index of ten websites and I came back this morning to
the following error messages. I hope I have supplied
enough information.

-------------------------

At the command line, I passed...
$ swish-e -c httpfilefilter.conf -S http -v 2 -e

-------------------------

# httpfilefilter.conf
These are all my comments, obviously. And the website
listed below doesn't exist, so don't try it... I
changed the names to protect the innocent :)

# General Information
IndexName		"NMIS Index"
IndexDescription	"Some NMIS sites." 
IndexPointer		http://www.nmis.tk/pointer.html
IndexAdmin		admin

# Index file destination
IndexFile		/swishdb/nmis.index

# This links to the list of web sites
IncludeConfigFile	sitelist.conf

# Internal HTTP Access-only directives
#MaxDepth		2
#Delay 			1
TmpDir			/swishdb/tmp

# "DefaultContents should be used over IndexContents
for using the internal spider"
# - too bad it make the results page on swish.cgi look
terrible! It shows the HTML tags!
IndexContents 		HTML* .htm .html .shtml .php .asp .cfm
IndexContents		TXT*  .txt .pdf .ppt .doc .rtf

# Do the translations
FileFilter 		.pdf /usr/bin/pdftotext "'%p' -"
FileFilter		.ppt /home/user/swish/myppt2html.sh
Filefilter 		.doc /usr/local/bin/catdoc "-a -s8859-1
-d8859-1 '%p'"
FileFilter 		.rtf /usr/bin/strings "'%p'"

# These file extensions will NOT have their contents
indexed, but the path name will be instead
NoContents 		.jpg .gif .swf .mov .mpg .avi .mp3

# You've got to save SOME data! for .cgi search
results page
StoreDescription 	HTML* <body> 50000
StoreDescription	TXT* 50000
PropCompressionLevel	9

# No one searches for these words so, don't index them
IgnoreWords		www http a an the of and or

# Make the bullets on swish.cgi work! swishdocpath is
default; make sure these are in swishcgi.conf
MetaNames		swishdocpath swishtitle swishdescription
swishlastmodified

--------------------

# sitelist.conf

IndexDir	http://code200.nmis.tk
IndexDir	http://aar.nmis.tk/
IndexDir	http://aribh.nmis.tk/
IndexDir	http://atm.nmis.tk/
IndexDir	http://deppjhu.nmis.tk/
IndexDir	http://exp.gbfy.nmis.tk/
IndexDir	http://etth.nmis.tk/
IndexDir	http://iss2www.cdcsc.nmis.tk/
IndexDir	http://lgis.nmis.tk/
IndexDir	http://scifi.bbnc.nmis.tk/

--------------------

I wrote my own FileFilter. Like I said, I'm not a Perl
expert by any means, so the only way I was able to get
the results I wanted was to duplicate what I had done
while testing with the command line. And since you
can't do anything like "FileFilter .ppt command |
command" (pipe), I wrote I simple bash script to do
it. The perl script was written to remove the HTML
tags from the ppthtml, since I really just want
ppt-to-text. This is my humble attempt at it :)

## myppt2html.sh
#!/bin/sh

# In swish-e's conf file, you can't run pipes, so
running this script runs the pipe!

ppthtml $1 | perl myppt2html.pl

## myppt2html.pl
#! /usr/bin/perl -w
use strict;

#
==========================================================
# == removes HTML tags from search results off .ppt
files ==
#
==========================================================

my $temp;
my $count = 0;
while ($temp=<STDIN>){
    $temp =~ s/<\!DOCTYPE HTML PUBLIC
\"-\/\/W3C\/\/DTD HTML 4.0\/\/EN\">//g;
    $temp =~ s/<br>//g;
    $temp =~ s/<BR>//g;
    $temp =~ s/<HR>//g;
    $temp =~ s/<HTML>//g;
    $temp =~ s/<TITLE>//g;
    $temp =~ s/<HEAD>//g;
    $temp =~ s/<BODY>//g;
    $temp =~ s/<\/HTML>//g;
    $temp =~ s/<\/TITLE>//g;
    $temp =~ s/<\/HEAD>//g;
    $temp =~ s/<\/BODY>//g;
    
    if ($count != 1){
        print $temp;
    }
    $count++;
}

-------------------------

Now, onto the errors. The spider didn't complete,
first of all. The last messages that were displayed
were...

retrieving
http://iss2www.cdcsc.nmis.tk/ss/issapt/mio/7AStageONS1-4-01.doc
(5)...
retrieving
http://iss2www.cdcsc.nmis.tk/ss/issapt/mio/Inc2postfltrpt_020504.doc
(5)...
Bad BBD entry!
Segmentation fault


Another error I saw, which I'm just assuming was the
FileFilter itself, NOT SWISH-E, but I'll post it
anyway...

retrieving
http://iss2www.cdcsc.nmis.tk/ss/issapt/mio/BackgroundCheckForm531.pdf
(5)...
Error (1064): Missing 'endstream'


Aside from the first one, this other one is the most
alarming to me, especially since I had been using the
-e option when I ran SWISH-E...

error : Memory allocation failed : growing buffer

I have gnome-terminal set to save 1000 lines, and this
error message starts at the top and then repeats at
least 3/4th of the way down the scrollbar, so I'm
thinking it showed around 750 times. After than it
started with more files from (I'm assuming) the same
site it had been previously spidering.

-------------------------

I am aware of the ability to merge indicies, whereas I
could spider individual sites and then merge the
results, but it's more convienient to just run a list.
My computer might be the problem, it's a PIII-500MHz,
512MB RAM. Disk space shouldn't be a problem, as seen
with $ df. Of course, this is after the failed spider
with all of the temporary files located in the hda1
partition...

Filesystem           1K-blocks      Used Available
Use% Mounted on
/dev/hda1              9470424   6070360   3400064 
65% /
/dev/hda2              2252788    207976   2044812 
10% /home
/dev/hda5               307056     87156    219900 
29% /usr/local
/dev/hda3              1228884    167916   1060968 
14% /var

-------------------------

For what it's worth, here's the directory information
for the destination folder for the index. I find it
odd that the properties files is dramatically larger
than the index itself...

$ ls -l
total 10196
-rw-rw-r--  1 user user 10025082 Jul  1 11:36
nmis.index.prop.temp
-rw-rw-r--  1 user user   401084 Jun 30 15:38
nmis.index.temp
drwxrwxr-x  2 user user    12208 Jul  1 11:31 tmp

The tmp folder is odd too. Of course there are
hundreds of swtmploc files in there because the index
was interrupted, but the first two files show...

$ ls -l
total 231020
-rw-rw-r--  1 user user 98409984 Jul  1 11:31
swishspider@20051.contents
-rw-rw-r--  1 user user       34 Jul  1 11:31
swishspider@20051.response
-rw-------  1 user user   262144 Jul  1 10:47
swtmploc0a75KB
-rw-------  1 user user   131072 Jul  1 05:29
swtmploc0okXWz

-------------------------

I'm new to SWISH-E, and fairly new to Linux (been
using extensively for about 3 months), which explains
why I don't understand these errors. I tried the
external spider.pl, but I couldn't figure out how to
write a module to convert .ppt files. Also when I ran
spider.pl, there were these weird, random characters
that would display on the search results pages for
doc files that looked like squares with 4 characters
in them.

I would greatly appreciate any help! I know this is a
pretty long document, so I apologize, but I thought
it's better to be extensive than to fall short. If
there's a better way to do what I'm trying to do, I am
very much open to suggestions! Thank you for your time!


		
__________________________________
Do you Yahoo!?
Yahoo! Mail - You care about security. So do we.
http://promotions.yahoo.com/new_mail
Received on Fri Jul 2 05:27:22 2004