At 05:39 AM 01/24/02 -0800, Rich Thomas wrote:
>I've included my confg file and a sample of results.
Thanks!
>Why do I get Null
>titles and no descriptions?
Titles work, so I'd need to see your CGI script. But you don't have a
<body> tag, and the basic HTML parser is not smart enough to fix your
broken HTML.
> head rich.html
<title> E/E/F/8/403 University at Buffalo Libraries Web Catalog</title> <br>
<h3>
United States. Bureau of Land Management.</h3> <br>
<h3>
BLM Wyoming fishing opportunities / United States Department of the
Interior, Bureau of Land Management.</h3> <br>
<h3>
Wyoming fishing opportunities</h3> <br>
<h3>
Title within map border: Fishing opportunities [place] Wyoming</h3> <br>
> cat rich.conf
StoreDescription HTML <body> 5000
> ./swish-e -c rich.conf -i rich.html -v 0
Indexing Data Source: "File-System"
Indexing done!
> ./swish-e -w horn -p swishdescription -H0
1000 rich.html "E/E/F/8/403 University at Buffalo Libraries Web Catalog"
1630 ""
.
See, we are getting the title, but not the description because there's no
body tag. Now, let's get the help from libxml2 (because it will attempt to
fix your html):
> cat rich.conf
StoreDescription HTML2 <body> 5000
DefaultContents HTML2
> ./swish-e -c rich.conf -i rich.html -v 0
Indexing Data Source: "File-System"
Indexing done!
> ./swish-e -w horn -p swishdescription -H0
1000 rich.html "E/E/F/8/403 University at Buffalo Libraries Web Catalog"
1630 "United States. Bureau of Land Management. BLM Wyoming fishing
opportunities /
..
There, libxml2 came to the rescue.
>How do I force swish-e not to follow all links when using the http method?
>Is this even possible?
robots.txt. Standard robots exclusion.
Too bad FileRules doesn't work on URLs.
I believe I've dropped hints all over the 2.1 docs that I'm not a fan of
the -S http method.
If you use -S prog and the spider.pl program you can have full control over
spidering. You can use robots.txt, or use the <META> robots exclusion tags
per document, or perl regular expressions or anything you can imagine to
control what is spidered.
--
Bill Moseley
mailto:moseley@hank.org
Received on Thu Jan 24 14:28:07 2002