Skip to main content.
home | support | download

Back to List Archive

Re: Spidering phpBB

From: Shaffer, Chris <Chris.Shaffer(at)not-real.BellSouth.com>
Date: Tue Aug 31 2004 - 15:03:04 GMT
Some interesting thoughts...  It'll take some investigation...

The problem is, the page content changes constantly, due to a 'Members
Online' section and a time stamp.  Also, you are correct that there are
many different ways to access the same data...

I'll play some more with the filtering of page names, and let you know.

Thanks for the advise.

Chris Shaffer
Application Developer, BSTCAD/BSTProcess
BSTCAD Support Forums
chris.shaffer@bellsouth.com
(404) 927-1227


-----Original Message-----
From: swish-e@sunsite3.berkeley.edu
[mailto:swish-e@sunsite3.berkeley.edu] On Behalf Of Bill Moseley
Sent: Tuesday, August 31, 2004 10:25 AM
To: Multiple recipients of list
Subject: [SWISH-E] Re: Spidering phpBB


On Tue, Aug 31, 2004 at 07:14:42AM -0700, Shaffer, Chris wrote:
> I do...  How would swish know which page to go to, though?

Maybe it's not that easy.  From a *very* quick look it seem like content
is organized by topic:

http://www.phpbb.com/phpBB/viewtopic.php?t=134922

So, assuming there's a topic table that links to articles, maybe you
could index all the articles for a given topic under that topic id. Then
search results would point to the topic.

That's all guessing, but maybe something like that would work.

Otherwise, I'm not sure why spidering is looping.  Do you have a small
or test phpbb setup that you can test with?  The problem may be just
that there's too many different ways to access the same data -- or just
too many dynamically created links in general and it's taking too much
time to visit all of them.  You might just need to restrict what type of
URLs you will follow when spidering.  Like making sure there's only a
"t" parameter with a numeric value and ignore all the other links.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Tue Aug 31 08:04:28 2004