Skip to main content.
home | support | download

Back to List Archive

RE: Problem on Parser with TXT/HTML and Spider.pl

From: Robert Keith <Robert(at)not-real.Technolords.com>
Date: Wed Apr 30 2003 - 03:07:25 GMT
Hi, Thanks for the reply.

I am running spider.pl version:
  # $Id: spider.pl,v 1.43 2002/09/11 00:53:44 whmoseley Exp $

This was version distributed with Swish:  swish-e-2.2.3.tar.gz

Excuse me for not mentioning that.

------------

When running the command:
perl /fs/area/intellisearch/search/prog-bin/spider.pl prof1.pl | swish-e -S
prog
 -c prof1 -i stdin  -v3

With debug => DEBUG_HEADERS  set:

And the URL's in this order:

     1.      base_url => 'http://www.intellivence.com/index.php/',
     2.      base_url =>
'http://theweb:access@www.intellivence.com/downloads/',

One of the header records was:

---HEADERS for http://www.intellivence.com/news.php ---
Connection: close
Date: Wed, 30 Apr 2003 10:37:46 GMT
Server: Apache/1.3.27 (Unix) PHP/4.1.2
Content-Language: en-us
Content-Type: text/html
Content-Type: text/html; charset=iso-8859-1
Client-Date: Wed, 30 Apr 2003 02:31:45 GMT
Client-Peer: 66.151.128.102:80
Link: <css/default.css>; rel="stylesheet"
Title: Intellivence News
X-Meta-COPYRIGHT: Copyright 2002 Intellivence Inc., All Rights Reserved.
X-Meta-DESCRIPTION: Intellivence integrates leading-edge data management,
knowle
dge management and search and retrieval products into plug-and-play
enterprise s
olutions.
X-Meta-KEYWORDS: KM, Knowledge management, enterprise, appliance, search,
retrie
val, data management, portal, enterprise portal, corporate portal,
ebusiness, ar
tificial intelligence, directory, parametric search
X-Meta-ROBOTS: ALL
X-Powered-By: PHP/4.1.2
-----END HEADERS----

http://www.intellivence.com/downloads.php - Using HTML parser -  (70 words)

And this works OK.
-------------------------------------------

When using the SwishSpiderConfig.pl (prof1.pl) config with the URL's
reversed:


     1.      base_url =>
'http://theweb:access@www.intellivence.com/downloads/',
     2.      base_url => 'http://www.intellivence.com/index.php/',

----HEADERS for http://www.intellivence.com/news.php ---
Connection: close
Date: Wed, 30 Apr 2003 10:42:04 GMT
Server: Apache/1.3.27 (Unix) PHP/4.1.2
Content-Language: en-us
Content-Type: text/html
Content-Type: text/html; charset=iso-8859-1
Client-Date: Wed, 30 Apr 2003 02:36:03 GMT
Client-Peer: 66.151.128.102:80
Link: <css/default.css>; rel="stylesheet"
Title: Intellivence News
X-Meta-COPYRIGHT: Copyright 2002 Intellivence Inc., All Rights Reserved.
X-Meta-DESCRIPTION: Intellivence integrates leading-edge data management,
knowle
dge management and search and retrieval products into plug-and-play
enterprise s
olutions.
X-Meta-KEYWORDS: KM, Knowledge management, enterprise, appliance, search,
retrie
val, data management, portal, enterprise portal, corporate portal,
ebusiness, ar
tificial intelligence, directory, parametric search
X-Meta-ROBOTS: ALL
X-Powered-By: PHP/4.1.2
-----END HEADERS----

http://www.intellivence.com/news.php - Using TXT parser -  (1133 words)

Whis is not so good.


========================================================

The Swish Config files are:

(prof)
IncludeConfigFile /fs/area/intellisearch/conf/common.config
IndexFile /fs/area/intellisearch/indexfiles/prof1
SwishProgParameters /fs/area/intellisearch/conf/prof1.pl

-----------

And (common.config) is:

WordCharacters abcdefghijklmnopqrstuvwxyz0123456789.-
IgnoreFirstChar .-
IgnoreLastChar  .-
BeginCharacters abcdefghijklmnopqrstuvwxyz0123456789
EndCharacters   abcdefghijklmnopqrstuvwxyz0123456789
IndexReport 2
TranslateCharacters :ascii7:
BumpPositionCounterCharacters |.

IndexDir /fs/area/intellisearch/search/prog-bin/spider.pl

StoreDescription HTML* <body> 400
StoreDescription TXT <body> 400
StoreDescription XML <body> 400

DefaultContents HTML
IndexContents HTML .htm .html .php .jsp .doc .xls
IndexContents TXT .txt .conf

Regards,
Robert Keith



> -----Original Message-----
> From: Bill Moseley [mailto:moseley@hank.org]
> Sent: Tuesday, April 29, 2003 4:14 PM
> To: Robert Keith
> Cc: swish-e@sunsite.berkeley.edu
> Subject: Re: [SWISH-E] Problem on Parser with TXT/HTML and Spider.pl
>
>
> On Tue Apr 29, 2003 at 03:30:23PM -0700, Robert Keith wrote:
> >
> > I am having a strange problem indexing a combination of MSWord,
> .txt and PHP
> > documents using spider.pl and feeding this into swish-e.  If I
> index the PHP
> > urls first, the documents are parsed and loaded as HTML.  If I
> select the
> > MSWord and other documents, which are filtered by the spider.pl filter
> > routines, the MSWord and other documents are parsed as TXT
> (correctly), then
> > when the subsequent PHP and HTML documents are parsed, they are
> parsed as
> > TXT.  The SwishSpiderConfig.pl file contains two entries, the
> URL with the
> > MSWord links, and the URL with only PHP links.
>
> Just to narrow things down, if you save the output from spider.pl
> to a file does it contain
> the header to set the parser type?  That is, is spider.pl adding a
>
>    Document-Type:
>
> header?  I think that code is new, so I'm not sure what you are
> using.  And if so can you
> check between the two indexing methods if they are set incorrectly?
>
> You can also turn on DEBUG_HEADERS ( debug => DEBUG_HEADERS ) in
> the config and watch what
> content-type is being returned.
>
> If it's not setting that header then we need to look at how swish
> is selecting the parser
> (which is based on extension as set by IndexContents and DefaultContents.
>
>
Received on Wed Apr 30 03:11:13 2003