At 11:54 AM 11/19/2001 +0000, Julian Perry wrote:
>The thing that actually fixed the problem was
>forcing use of the libxml2 parser.
I used the HTML you posted and your config file with the built-in HTML
parser and it worked.
> ./swish-e -f jindex -w not dkdk -p description
# SWISH format: 2.1-dev-24
# Search words: not dkdk
# Number of hits: 1
# Search time: 0.000 seconds
# Run time: 0.005 seconds
1000 j.html "Title" 437 "UK based Wine Shop offering 1000's of wines for
delivery world-wide. Award winning Web site with wine games, quiz, and
extensive back-ground information."
.
So, perhaps there's something else in that HTML doc that's causing
problems. Please send complete examples.
>I guess I'm
>happy with the fix, but surely it should have
>worked with the other parser (in the ideal world
>what we all live in)!
If the other parser worked perfectly I wouldn't have spent time adding
libxml2 to swish. Something fun for everyone is to index your documents
using both parsers, then use swish's -T index_words_only on both indexes
and run diff to see the differences.
>I've tried that, and I've got a pretty recent
>version of perl5 (5.6.1) and I've loaded all
>the modules that seem to be required - but I
>still can't get it running:
> Name "HTML::Tagset::linkElements" used only once: possible typo at
./swish-spider.pl line 503.
>
>and then a bunch of:
> Use of uninitialized value in hash element at ./swish-spider.pl line 509.
> Use of uninitialized value in hash element at ./swish-spider.pl line 509.
> Use of uninitialized value in hash element at ./swish-spider.pl line 509.
>
>Any thoughts?
Not really. I have a different spider.pl that you seem to have. If you
can use CVS then we can use the same version and find the problem. I just
installed new everything on a new machine (including a new perl, and all
the modules) and it went without a problem. I also can't offer any help
without knowing your config, and the commands you are using. It's a lot
easier, obviously, if I can duplicate your problem.
>Another problem, and I've been looking at this
>on 2.1-dev-24 because it's been a long-standing
>problem with 1.3.2, I get a Bus Error from swish
>when building an index.
>
>Can you suggest the best set of command line
>options to help debug this? Failing that I
>guess I'll be looking at running under GDB.
>It fails towards the end, as I remember.
gdb is the way to go, I'd think. Can you put together a few documents that
demonstrate the problem?
>I used to get problems when there were invalid
>characters in HREF's - i.e. single quotes, is
>swish particularly sensitive to things like
>that?
Not by design. Unlikely with libxml2, but hard to say without seeing an
example. When I first stated using libxml2 I had some problems with a few
HTML issue -- mostly with it hanging when swish tried to abort processing
in the middle of a doc. AFAIK, that's all been fixed in current versions
of libxml2.
So, see if you can get together an example document and a config file that
demonstrates the problem you are having.
http://www.swish-e.org/2.2/docs/INSTALL.html#When_posting_please_provide_the_
Thanks,
Bill Moseley
mailto:moseley@hank.org
Received on Mon Nov 19 14:26:32 2001