Re: Spidering phpBB

From: Shaffer, Chris <Chris.Shaffer(at)not-real.BellSouth.COM>
Date: Tue Aug 31 2004 - 20:05:33 GMT
Because phpBB search engine, in my opinion is really crummy...  For
those who don't know, here's how phpBBs search 'engine' works:

1.) when a post is made, all common language words are stripped out, as
well as all words that have already been indexed.
2.) what's left is put into a table, one record for each new word
3.) An intersection table is then updated with crosses for all
non-common words and the post number

That makes phrase and context searching impossible.

And yes, I'd love to be able to search our intranet site (including
forums) from one form...  That would be sweet...

As far as my problem crawling the forums...  I think I know what is
going one...  The session_id is changing occasionally, causing it to go
in circles...  Is there any way I can filter out something matching
'sid=....' from the end of the path before decides whether or
not its crawled it yet?

> The problem is, the page content changes constantly, due to a 'Members
> Online' section and a time stamp.  Also, you are correct that there
> are
BTW: why are you trying to indes a phpBB2 forum?  It has its own Search

There might be something you could do at the PHP-level to tie phpBB2's
search results with those from SWISH-E (I am assuming you are trying to
have searches for both your website and the forums in one location??).


