Skip to main content.
home | support | download

Back to List Archive

Description and Title not being parsed properly

From: Patrick Krug <patrick_krug(at)>
Date: Thu Jan 17 2002 - 00:01:48 GMT
This is a multi-part message in MIME format.

Content-Type: text/plain; format=flowed

I have been playing with Swish-e for a week now.  I have tried various 
things to get swish-e to index my site.  I have checked the HTML on my site 
for various problems.  I have corrected some pages tested swish-e with them 
and still does not grab the title nor the description.   One problem I have 
is not all the content on the site is mine.  It might look like it is from 
the site but it is grabbed my various means and put on the site so I cannot 
fix all the pages if there are obvious errors on them.   I have tried to 
compile the code on my machine but never was able to find the iconv.h file 
(the link on one of the websites is bad).

I have looked at the new examples and used those to index my site.  The only 
thing that I can get from the page is the title (sometimes) and never a 
description of the page.

I have also gotten a exe from Dave Norris that uses the HTML2 parser and 
this has not helped either.

I am attaching my configuration files for inspection.   I have attached a 
test configuration file which will only index one page.  If this page is 
indexed properly then I believe a big portion of the site will be indexed 

Thanks in advance.

Patrick Krug

Join the world’s largest e-mail service with MSN Hotmail.

Content-Type: text/plain; name="newtest.config"; format=flowed
Content-Transfer-Encoding: 8bit
Content-Disposition: attachment; filename="newtest.config"

# ----- Example 7 - Spider using "http" method -------
#  Please see the swish-e documentation for
#  information on configuration directives.
#  Documentation is included with the swish-e
#  distribution, and also can be found on-line
#  at
#  This example demonstrates how to use the
#  the "http" method of spidering.
#  Indexing (spidering) is started with the following
#  command issued from the "conf" directory:
#     swish-e -S http -c example7.config
#  Note: You should have the current Bundle::LWP bundle
#  of perl modules installed.  This was tested with:
#     libwww-perl-5.53
#  ** Do not spider a web server without permission **

# Include our site-wide configuration settings:

IncludeConfigFile bh_allsites.config
IndexContents HTML2 .htm .html .shtml
StoreDescription HTML2 <body> 20
StoreDescription HTML2 <title> 40
# Specify the URL (or URLs) to index:
IndexFile newtest.index

# If a server goes by more than one name you can use this directive:

#EquivalentServer http://c

# end of example

Content-Type: text/plain; name="bh_allsites.config"; format=flowed
Content-Transfer-Encoding: 8bit
Content-Disposition: attachment; filename="bh_allsites.config"

# This defines how many links the spider should
# follow before stopping.  A value of 0 configures the spider to
# traverse all links. The default is 5
# The idea is to limit spidering, but seems of questionable use
# since depth may not be related to anything useful.

MaxDepth 1

# The number of seconds to wait between issuing
# requests to a server.  The default is 60 seconds.

Delay 0

#  Swish-e 2.1 configuration file
#  Pkrug 1/15/2002
#  taken from example 7 - http method

# (default /var/tmp)  The location of a writeable temp directory
# on your system.  The HTTP access method tells the Perl helper to place
# its files there.  The default is defined in src/config.h and depends on
# the current OS.

#TmpDir .

# The "http" method uses a perl helper program to fetch each document
# from the web called "swishspider" and is included in the src directory of
# the swish-e distribution.
#SpiderDirectory ./src
#pkrug 01/14/2002
#FileFilter .pdf pdftotext.exe "%p -"
#FileFilter .pdf pdftotext.exe "%p -"
FileFilter .pdf prog-bin/pdf2html
#IndexContent HTML .pdf
#pkrug 01/14/2002 end

Received on Thu Jan 17 00:03:21 2002