Hi,
today, from one of our production cluster (not under extremely stress condition
recently and scale limited to four nodes only), i found the fsimage and all
name server metadata are missing and result in scheduled MR tasks fail.
however, from name node log message, we don't see much useful trouble shooting
info nor clue that could lead to this disaster. though it's easier to reform
the metadata or recover from check pt, but i am curious what could be the
possible reason result in this severe situation since we never read the same
problem before from the other setup (identical h/w but with less load applied).
our name node running raid1 and we don't read any h/w alarm from remote console
nor system log. the only warning i found is the '/getimage:
java.io.IOException: GetImage failed. java.lang.NullPointerException' io
exception two days ago, could this related to the corruption?
could expert kindly shed some light on this issue, thanks.
Cheers,
Jason
- name node fail to start:
12/07/09 10:49:57 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=NameNode, sessionId=null
12/07/09 10:49:57 INFO metrics.NameNodeMetrics: Initializing NameNodeMeterics
using context object:org.apache.hadoop.metrics.ganglia.GangliaContext31
12/07/09 10:49:57 INFO util.GSet: VM type = 64-bit
12/07/09 10:49:57 INFO util.GSet: 2% max memory = 2.45375 MB
12/07/09 10:49:57 INFO util.GSet: capacity = 2^18 = 262144 entries
12/07/09 10:49:57 INFO util.GSet: recommended=262144, actual=262144
12/07/09 10:49:57 INFO namenode.FSNamesystem: fsOwner=hdfs
12/07/09 10:49:57 INFO namenode.FSNamesystem: supergroup=supergroup
12/07/09 10:49:57 INFO namenode.FSNamesystem: isPermissionEnabled=true
12/07/09 10:49:57 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=1000
12/07/09 10:49:57 INFO namenode.FSNamesystem: isAccessTokenEnabled=false
accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
12/07/09 10:49:57 INFO metrics.FSNamesystemMetrics: Initializing
FSNamesystemMetrics using context
object:org.apache.hadoop.metrics.ganglia.GangliaContext31
12/07/09 10:49:57 ERROR namenode.FSNamesystem: FSNamesystem initialization
failed.
java.io.IOException: NameNode is not formatted.
at
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:329)
at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:99)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:358)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:327)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:271)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:465)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1239)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1248)
12/07/09 10:49:57 ERROR namenode.NameNode: java.io.IOException: NameNode is not
formatted.
at
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:329)
at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:99)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:358)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:327)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:271)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:465)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1239)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1248)
- earlier msg:
2012-07-07 00:03:35,463 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from
192.168.70.18
2012-07-07 00:03:35,465 WARN org.mortbay.log: /getimage: java.io.IOException:
GetImage failed. java.lang.NullPointerException
at
org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:83)
at
org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:78)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at
org.apache.hadoop.hdfs.server.namenode.GetImageServlet.doGet(GetImageServlet.java:78)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
at
org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:829)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
2012-07-07 00:05:01,122 INFO org.apache.hadoop.hdfs.StateChange: *DIR*
NameNode.reportBadBlocks
2012-07-07 00:05:01,122 INFO org.apache.hadoop.hdfs.StateChange: BLOCK
NameSystem.addToCorruptReplicasMap: blk_1866785886679796114 added as corrupt on
192.168.70.19:50010 by /192.168.70.19
2012-07-07 00:05:57,707 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask
192.168.70.20:50010 to replicate blk_-5834163689094324395_5939 to datanode(s)
192.168.70.19:50010 192.168.70.21:50010
2012-07-07 00:05:57,851 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated: 192.168.70.19:50010 is added to
blk_-5834163689094324395_5939 size 142551
2012-07-07 00:05:57,887 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated: 192.168.70.21:50010 is added to
blk_-5834163689094324395_5939 size 142551
2012-07-07 00:06:00,707 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask
192.168.70.21:50010 to replicate blk_-269424213420340184_5940 to datanode(s)
192.168.70.19:50010