Skip to main content.
home | support | download

Back to List Archive

Re: Multiple web sites

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Mar 23 2004 - 18:39:04 GMT
On Tue, Mar 23, 2004 at 10:08:11AM -0800, Lung.Allen wrote:

Please post to the swish-e list, not directly to me.

You are mixing configurations below.  There's a config file for swish-e
to control how it indexes.  There's also a config file for spider.pl that
controls what get's spidered.

Things like "IndexOnly" and "IndexContents" does not go in the spider
config file.

The config you have below does not do any filtering.  You will be
passing binary files to swish-e.

Can you take some time an re-read the INSTALL documentation?  It might
clear some things up.  Then read the spider.pl documentation about
filtering.  

Unfortunately, there's more than one way to filter so it
makes things more complicated when first starting out.  Try and skip down in
spider.pl's doc to the section about filters and look at the
SwishSpiderConfig.pl example configuration on how to enable
SWISH::Filter to convert your Word documents to a format that swish-e
can process.

Spider.pl will use SWISH::Filter when using its default configuration,
but whey you create your own spider config file then you need to also
tell it to filter.  SwishSpiderConfig.pl has examples.

> 
>  libxml2-2.6.7.tar.gz
> 
> This is the package I installed.  I'm trying this next:
> 
> my %main_site = (
>             base_url   => 'http://10.10.10.10/',
>             email      => 'allen.lung@ftb.ca.gov',
>             IndexOnly .htm .html .asp
>             IndexContents HTML
>         );
> 
> 
>         my %news_site = (
>             base_url   => 'http://10.10.10.11/doc',
>             email      => 'allen.lung@ftb.ca.gov',
>             IndexOnly .htm .html .asp
>             IndexContents HTML
>         );
> 
>         @servers = ( \%main_site, \%news_site );
>         1;
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Bill Moseley [mailto:moseley@hank.org]
> Sent: Tuesday, March 23, 2004 9:40 AM
> To: Lung.Allen
> Cc: Multiple recipients of list
> Subject: Re: Multiple web sites
> 
> 
> On Tue, Mar 23, 2004 at 09:15:51AM -0800, Lung.Allen wrote:
> > Warning: Substituted 6741 embedded null character(s) in file 'http://10.10.10.10/ghgmemo.doc' with a newline
> 
> http://swish-e.org/Discussion/search/swish.cgi?query=embedded+null+character
> 
> Looks like you are trying to index binary data.
> 
> BTW -- are you using libxm2 parser? (i.e. using
> DefaultContents/IndexContents HTML2?)
> 
> -- 
> Bill Moseley
> moseley@hank.org
> 
> 

-- 
Bill Moseley
moseley@hank.org
Received on Tue Mar 23 10:39:05 2004