I was able to resolve this. Here are the details in case anyone else has a similar problem:
Note: I noticed edits were being recorded by the journal nodes so I was probably always safe but I did still fix the namenode without a restart to make sure it was doable. Looking at the code I learned that it's possible for namenode storage directories to be marked failed and that there's a jmx field for observing that: $ curl -s http://my.namenode:50070/jmx | grep NameDirStatus "NameDirStatuses" : "{\"active\":{},\"failed\":{\"/opt/hdfs-data/dfs/name\":\"IMAGE_AND_EDITS\"}}" So yeah. That was clearly the problem. And there's a command for telling the namenode to update an internal property that indicates it should attempt to activate failed storage directories during certain events. $ hdfs dfsadmin -restoreFailedStorage true Then I waited an hour to see if a checkpoint would occur and trigger an attempt to restore the failed storage directory. It never did. Docs suggest it should have. I then noticed that the code path for saveNamespace() includes an attempt to restore failed storage directories. That can be triggered with "hdfs dfsadmin -saveNamespace". I put the cluster in safe mode and ran that command. Success. From the namenode log: 2021-08-10 14:25:47,384 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Save namespace ... 2021-08-10 14:25:47,385 INFO org.apache.hadoop.hdfs.server.common.Storage: NNStorage.attemptRestoreRemovedStorage: check removed(failed) storage. removedStorages size = 1 2021-08-10 14:25:47,385 INFO org.apache.hadoop.hdfs.server.common.Storage: currently disabled dir /opt/hdfs-data/dfs/name; type=IMAGE_AND_EDITS ;canwrite=true 2021-08-10 14:25:47,385 INFO org.apache.hadoop.hdfs.server.common.Storage: restoring dir /opt/hdfs-data/dfs/name ... 2021-08-10 14:25:47,966 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: New namespace image has been created and $ curl -s http://my.namenode:50070/jmx | grep NameDirStatus "NameDirStatuses" : "{\"active\":{\"/opt/hdfs-data/dfs/name\":\"IMAGE_AND_EDITS\"},\"failed\":{}}", I was then able to bootstrap and start a standby namenode and watch edits being recorded to disk by the active namenode again. On Mon, Aug 9, 2021 at 3:44 PM Whitney Jackson <[email protected]> wrote: > Hi, > > My cluster is up and running after its two namenodes ran out of disk > space. It's mostly happy except that the currently active namenode isn't > recording edits to disk. I don't see any modifications to the > edits_inprogress file and no new fsimage files are being recorded. > > If I enter safe mode and try to run "hdfs dfsadmin -saveNamespace" I get: > > java.io.IOException: No image directories available! > > at > org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:1218) > > at > org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:1162) > > at > org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:1132) > > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveNamespace(FSNamesystem.java:4494) > > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.saveNamespace(NameNodeRpcServer.java:1270) > > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.saveNamespace(ClientNamenodeProtocolServerSideTranslatorPB.java:873) > > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528) > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) > > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:999) > > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:927) > > at java.base/java.security.AccessController.doPrivileged(Native > Method) > > at java.base/javax.security.auth.Subject.doAs(Subject.java:423) > > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2915) > > The standby namenode won't start and attempts to bootstrap it fail with: > "Could not find image with txid ..." > > I'm concerned that if I restart the running namenode all edits since it > stopped recording to disk will be lost. > > Is there anything I can do to resolve? > > Thanks, > > Whitney >
