if u use ZK embedded in HBase, 2 settings in the hbase-site.xml are related this problem. *zookeeper.session.timeout* and *hbase.zookeeper.property.tickTime* I set the timeout is 4000, so the timeout is 4 seconds.
in my test, *hbase.master.lease.period* does not affect the time period between failure and reassignment. 2010/4/27 Edward Capriolo <[email protected]> > 2010/4/26 Michał Podsiadłowski <[email protected]> > > > Hi hbase users, > > > > during our tests on production environment we found few really big > > problems that stops us from using hbase. First major problem is > > availability: we have now 6 regions servers + 2 masters + 3 zk. When > > we shutdown normally one region servers it takes about 3-4 minutes or > > longer depends on previous load till master will reassign missing > > regions to alive rs. On regions servers there is usually less then 100 > > regions. In master logs we can see some log splitting and then long > > brake and start of reassignning that also can take long time > > especially when cluster is under load. This is way to long we can wait > > because during that time requests to website are not processed. > > Additional very unfortunate situation happened when my friend shutdown > > 3 out of 6 nodes - master started to do the job but something went > > terribly wrong and it started to throw NPE's like mad. > > Here is beginning of disaster : http://pastebin.com/1uh1x1fL after we > > killed this server second one pick up and manage to start but with > > only 91 out of 306 regions and after some long time. > > Another big problem is that table connections in some circumstances > > hangs no error thrown. Web servers request processing threadpool > > quickly runs out of threads and no request are processed and watchdog > > kills the server. > > > > > > for those who want more lecture : http://pastebin.com/UaEPT6nc master > > log from beginning of test > > and second master log http://pastebin.com/shpcDWBn > > > > > > Any help appreciated. > > Thanks, Michal > > > > I noticed our region failovers are around 10 - 30 seconds but we did not > have very high load at the time. > > As for the client. We noticed this too. If something fails in the hbase > stack zookeeper, region, etc. The connections never seemed to timeout. We > would end up with many webserver threads waiting and hanging on hbase that > were never going to recover. I think there are many cases where clients > never timeout. Sorry for a vague unsubstantiated statement like that (with > no stack trace). > -- Sincerely Yen-Liang, Su www.xpsteven.com
