Skip to main content.
home | support | download

Back to List Archive

Specified IndexContents HTML but swish still uses HTML2

From: Jon Sorensen <jon(at)not-real.starkmedia.com>
Date: Fri Feb 03 2006 - 16:22:15 GMT
I'm trying to fix a problem with indexing HTML entities
since libxml2 is installed char entities are automatically converted.
I want to preserve entities so I thought that I could use the
ConvertHTMLEntities set to no and use the HTML internal parser
instead of HTML2 but when I run swish-e it responds with
"Using HTML2 parser ". Also, the descriptions are now missing

thanks for any help on this!

#################################

IndexFile /www/mysite
IndexDir spider.pl
SwishProgParameters /www/mysite.com/cgi-bin/mysite_english.spider.config

PropertyNames description
PropertyNamesMaxLength 1000 description
MetaNames description keywords swishdocpath swishtitle category

StoreDescription HTML <body> 200000
ConvertHTMLEntities no
DefaultContents HTML
IndexContents HTML .cfm .cfml .htm .html

ExtractPath category regex !^(http://)*[^/]*/([^/]+)/.*$!$2! #get 1st
directory name (dir)
IgnoreMetaTags script style

FileFilter .pdf  pdftotext   "'%p' -"
IndexContents HTML* .pdf

ReplaceRules regex  !^(.*\?)(swishlang=[^&]+&*)(.*)?!$1$3!


###############################

my %serverA = (
        base_url    => 'http://www.mysite.com/index.cfm?swishlang=english',
        same_hosts  => [ qw/mysite.com/],
        email       => 'name@email.com',
  keep_alive  => 0,
  use_md5     => 1,
  max_files     => 5,
  use_cookies  => 1,
);

@servers = ( \%serverA, );

###############################

swish-e -c /www/mysite.com/cgi-bin/mysite_english.cfg -S prog -e  -v 3
Parsing config file '/www/mysite.com/cgi-bin/mysite_english.cfg'
Indexing Data Source: "External-Program"
Indexing "spider.pl"
External Program found: /usr/local/lib/swish-e/spider.pl
/usr/local/lib/swish-e/spider.pl: Reading parameters from
'/www/mysite.com/cgi-bin/mysite_english.spider.config'
http://www.mysite.com/index.cfm?swishlang=english - Using HTML2 parser -
(339 words)
http://www.mysite.com/index.cfm - Using HTML2 parser -  (339 words)
http://www.mysite.com/landing.cfm - Using HTML2 parser -  (264 words)
http://www.mysite.com/site_map/index.cfm - Using HTML2 parser -  (103 words)
/usr/local/lib/swish-e/spider.pl: Max files Reached

Summary for: http://www.mysite.com/index.cfm?swishlang=english
Connection: Close:      6  (0.2/sec)
       Duplicates:    175  (6.7/sec)
   Off-site links:     10  (0.4/sec)
      Total Bytes: 67,224  (2585.5/sec)
       Total Docs:      5  (0.2/sec)
      Unique URLs:      6  (0.2/sec)
http://www.mysite.com/about/index.cfm - Using HTML2 parser -  (313 words)

Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 572 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
572 unique words indexed.
6 properties sorted.
5 files indexed.  67,224 total bytes.  1,420 total words.
Elapsed time: 00:00:26 CPU time: 00:00:00
Indexing done!
Received on Fri Feb 3 08:22:19 2006