Re: Very long time between node failure and reasing of regions.

Su Yen Liang Mon, 26 Apr 2010 10:47:48 -0700

if u use ZK embedded in HBase, 2 settings in the hbase-site.xml are related
this problem.
*zookeeper.session.timeout* and *hbase.zookeeper.property.tickTime*
I set the timeout is 4000, so the timeout is 4 seconds.


in my test, *hbase.master.lease.period* does not affect the time period
between failure and reassignment.

2010/4/27 Edward Capriolo <[email protected]>

> 2010/4/26 Michał Podsiadłowski <[email protected]>
>
> > Hi hbase users,
> >
> > during our tests on production environment we found few really big
> > problems that stops us from using hbase. First major problem is
> > availability: we have now 6 regions servers + 2 masters + 3 zk. When
> > we shutdown normally one region servers it takes about 3-4 minutes or
> > longer depends on previous load till master will reassign missing
> > regions to alive rs. On regions servers there is usually less then 100
> > regions. In master logs we can see some log splitting and then long
> > brake and start of reassignning that also can take long time
> > especially when cluster is under load. This is way to long we can wait
> > because during that time requests to website are not processed.
> > Additional very unfortunate situation happened when my friend shutdown
> > 3 out of 6 nodes - master started to do the job but something went
> > terribly wrong and it started to throw NPE's like mad.
> > Here is beginning of disaster : http://pastebin.com/1uh1x1fL after we
> > killed this server second one pick up and manage to start but with
> > only 91 out of 306 regions and after some long time.
> > Another big problem is that table connections in some circumstances
> > hangs no error thrown. Web servers request processing threadpool
> > quickly runs out of threads and no request are processed and watchdog
> > kills the server.
> >
> >
> > for those who want more lecture : http://pastebin.com/UaEPT6nc master
> > log from beginning of test
> > and second master log http://pastebin.com/shpcDWBn
> >
> >
> > Any help appreciated.
> > Thanks, Michal
> >
>
> I noticed our region failovers are around 10 - 30 seconds but we did not
> have very high load at the time.
>
> As for the client. We noticed this too. If something fails in the hbase
> stack zookeeper, region, etc. The connections never seemed to timeout. We
> would end up with many webserver threads waiting and hanging on hbase that
> were never going to recover. I think there are many cases where clients
> never timeout. Sorry for a vague unsubstantiated statement like that (with
> no stack trace).
>



-- 
Sincerely
Yen-Liang, Su
www.xpsteven.com

Re: Very long time between node failure and reasing of regions.

Reply via email to