I'm using the internal spider because I (unexperienced
at Perl) found it easier to use FileFilters. I ran
some short tests (Max_Level at 2 or 3) and everything
worked great. Wednesday afternoon I wanted to make an
index of ten websites and I came back this morning to
the following error messages. I hope I have supplied
enough information.
-------------------------
At the command line, I passed...
$ swish-e -c httpfilefilter.conf -S http -v 2 -e
-------------------------
# httpfilefilter.conf
These are all my comments, obviously. And the website
listed below doesn't exist, so don't try it... I
changed the names to protect the innocent :)
# General Information
IndexName "NMIS Index"
IndexDescription "Some NMIS sites."
IndexPointer http://www.nmis.tk/pointer.html
IndexAdmin admin
# Index file destination
IndexFile /swishdb/nmis.index
# This links to the list of web sites
IncludeConfigFile sitelist.conf
# Internal HTTP Access-only directives
#MaxDepth 2
#Delay 1
TmpDir /swishdb/tmp
# "DefaultContents should be used over IndexContents
for using the internal spider"
# - too bad it make the results page on swish.cgi look
terrible! It shows the HTML tags!
IndexContents HTML* .htm .html .shtml .php .asp .cfm
IndexContents TXT* .txt .pdf .ppt .doc .rtf
# Do the translations
FileFilter .pdf /usr/bin/pdftotext "'%p' -"
FileFilter .ppt /home/user/swish/myppt2html.sh
Filefilter .doc /usr/local/bin/catdoc "-a -s8859-1
-d8859-1 '%p'"
FileFilter .rtf /usr/bin/strings "'%p'"
# These file extensions will NOT have their contents
indexed, but the path name will be instead
NoContents .jpg .gif .swf .mov .mpg .avi .mp3
# You've got to save SOME data! for .cgi search
results page
StoreDescription HTML* <body> 50000
StoreDescription TXT* 50000
PropCompressionLevel 9
# No one searches for these words so, don't index them
IgnoreWords www http a an the of and or
# Make the bullets on swish.cgi work! swishdocpath is
default; make sure these are in swishcgi.conf
MetaNames swishdocpath swishtitle swishdescription
swishlastmodified
--------------------
# sitelist.conf
IndexDir http://code200.nmis.tk
IndexDir http://aar.nmis.tk/
IndexDir http://aribh.nmis.tk/
IndexDir http://atm.nmis.tk/
IndexDir http://deppjhu.nmis.tk/
IndexDir http://exp.gbfy.nmis.tk/
IndexDir http://etth.nmis.tk/
IndexDir http://iss2www.cdcsc.nmis.tk/
IndexDir http://lgis.nmis.tk/
IndexDir http://scifi.bbnc.nmis.tk/
--------------------
I wrote my own FileFilter. Like I said, I'm not a Perl
expert by any means, so the only way I was able to get
the results I wanted was to duplicate what I had done
while testing with the command line. And since you
can't do anything like "FileFilter .ppt command |
command" (pipe), I wrote I simple bash script to do
it. The perl script was written to remove the HTML
tags from the ppthtml, since I really just want
ppt-to-text. This is my humble attempt at it :)
## myppt2html.sh
#!/bin/sh
# In swish-e's conf file, you can't run pipes, so
running this script runs the pipe!
ppthtml $1 | perl myppt2html.pl
## myppt2html.pl
#! /usr/bin/perl -w
use strict;
#
==========================================================
# == removes HTML tags from search results off .ppt
files ==
#
==========================================================
my $temp;
my $count = 0;
while ($temp=<STDIN>){
$temp =~ s/<\!DOCTYPE HTML PUBLIC
\"-\/\/W3C\/\/DTD HTML 4.0\/\/EN\">//g;
$temp =~ s/<br>//g;
$temp =~ s/<BR>//g;
$temp =~ s/<HR>//g;
$temp =~ s/<HTML>//g;
$temp =~ s/<TITLE>//g;
$temp =~ s/<HEAD>//g;
$temp =~ s/<BODY>//g;
$temp =~ s/<\/HTML>//g;
$temp =~ s/<\/TITLE>//g;
$temp =~ s/<\/HEAD>//g;
$temp =~ s/<\/BODY>//g;
if ($count != 1){
print $temp;
}
$count++;
}
-------------------------
Now, onto the errors. The spider didn't complete,
first of all. The last messages that were displayed
were...
retrieving
http://iss2www.cdcsc.nmis.tk/ss/issapt/mio/7AStageONS1-4-01.doc
(5)...
retrieving
http://iss2www.cdcsc.nmis.tk/ss/issapt/mio/Inc2postfltrpt_020504.doc
(5)...
Bad BBD entry!
Segmentation fault
Another error I saw, which I'm just assuming was the
FileFilter itself, NOT SWISH-E, but I'll post it
anyway...
retrieving
http://iss2www.cdcsc.nmis.tk/ss/issapt/mio/BackgroundCheckForm531.pdf
(5)...
Error (1064): Missing 'endstream'
Aside from the first one, this other one is the most
alarming to me, especially since I had been using the
-e option when I ran SWISH-E...
error : Memory allocation failed : growing buffer
I have gnome-terminal set to save 1000 lines, and this
error message starts at the top and then repeats at
least 3/4th of the way down the scrollbar, so I'm
thinking it showed around 750 times. After than it
started with more files from (I'm assuming) the same
site it had been previously spidering.
-------------------------
I am aware of the ability to merge indicies, whereas I
could spider individual sites and then merge the
results, but it's more convienient to just run a list.
My computer might be the problem, it's a PIII-500MHz,
512MB RAM. Disk space shouldn't be a problem, as seen
with $ df. Of course, this is after the failed spider
with all of the temporary files located in the hda1
partition...
Filesystem 1K-blocks Used Available
Use% Mounted on
/dev/hda1 9470424 6070360 3400064
65% /
/dev/hda2 2252788 207976 2044812
10% /home
/dev/hda5 307056 87156 219900
29% /usr/local
/dev/hda3 1228884 167916 1060968
14% /var
-------------------------
For what it's worth, here's the directory information
for the destination folder for the index. I find it
odd that the properties files is dramatically larger
than the index itself...
$ ls -l
total 10196
-rw-rw-r-- 1 user user 10025082 Jul 1 11:36
nmis.index.prop.temp
-rw-rw-r-- 1 user user 401084 Jun 30 15:38
nmis.index.temp
drwxrwxr-x 2 user user 12208 Jul 1 11:31 tmp
The tmp folder is odd too. Of course there are
hundreds of swtmploc files in there because the index
was interrupted, but the first two files show...
$ ls -l
total 231020
-rw-rw-r-- 1 user user 98409984 Jul 1 11:31
swishspider@20051.contents
-rw-rw-r-- 1 user user 34 Jul 1 11:31
swishspider@20051.response
-rw------- 1 user user 262144 Jul 1 10:47
swtmploc0a75KB
-rw------- 1 user user 131072 Jul 1 05:29
swtmploc0okXWz
-------------------------
I'm new to SWISH-E, and fairly new to Linux (been
using extensively for about 3 months), which explains
why I don't understand these errors. I tried the
external spider.pl, but I couldn't figure out how to
write a module to convert .ppt files. Also when I ran
spider.pl, there were these weird, random characters
that would display on the search results pages for
doc files that looked like squares with 4 characters
in them.
I would greatly appreciate any help! I know this is a
pretty long document, so I apologize, but I thought
it's better to be extensive than to fall short. If
there's a better way to do what I'm trying to do, I am
very much open to suggestions! Thank you for your time!
__________________________________
Do you Yahoo!?
Yahoo! Mail - You care about security. So do we.
http://promotions.yahoo.com/new_mail
Received on Fri Jul 2 05:27:22 2004