Hi,
cluster details:
hbase 0.90.2. 10 machines. 1GB switch.
use-case
M/R job that inserts about 10 million rows to hbase in the reducer, followed
by M/R that works with hdfs files.
When the first job maps finish the second job maps starts and region server
crushes.
please note, that when running the 2 jobs separately they both finish
successfully.
>From our monitoring we see that when the 2 jobs work together the network
load reaches to our max bandwidth - 1GB.
In the region server log we see these exceptions:
a.
2011-08-14 18:37:36,263 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
Responder, call multi(org.apache.hadoop.hbase.client.MultiAction@491fb2f4)
from 10.11.87.73:33737: output error
2011-08-14 18:37:36,264 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 24 on 8041 caught: java.nio.channels.ClosedChannelException
at
sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
at
org.apache.hadoop.hbase.ipc.HBaseServer.channelIO(HBaseServer.java:1387)
at
org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1339)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:727)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:792)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1083)
b.
2011-08-14 18:41:56,225 WARN org.apache.hadoop.hdfs.DFSClient:
DFSOutputStream ResponseProcessor exception for block
blk_-8181634225601608891_579246java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at java.io.DataInputStream.readLong(DataInputStream.java:399)
at
org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:122)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2548)
c.
2011-08-14 18:42:02,960 WARN org.apache.hadoop.hdfs.DFSClient: Failed
recovery attempt #0 from primary datanode 10.11.87.72:50010
org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.ipc.RemoteException: java.io.IOException:
blk_-8181634225601608891_579246 is already commited, storedBlock == null.
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.nextGenerationStampForBlock(FSNamesystem.java:4877)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.nextGenerationStamp(NameNode.java:501)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:961)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:957)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:955)
at org.apache.hadoop.ipc.Client.call(Client.java:740)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at $Proxy4.nextGenerationStamp(Unknown Source)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.syncBlock(DataNode.java:1577)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1551)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1617)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:961)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:957)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:955)
at org.apache.hadoop.ipc.Client.call(Client.java:740)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at $Proxy9.recoverBlock(Unknown Source)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2706)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1500(DFSClient.java:2173)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2372)
Few questions:
1. Can we configure hadoop/hbase not to consume all network resources (e.g.,
to specify upper limit for map/reduce network load)?
2. Should we increase the timeout for open connections ?
3. Can we assign different IPs for data transfer and region quorum check
protocol (zookeeper) ?
Thanks,
Lior