I have an HDFS cluster of three nodes. They are all running on Amazon EC2 instances. I am using HDFS for an HBase backing store. Periodically, I will start the cluster and the name node stays in safe mode because it says the number of live datanodes has dropped to 0.
The number of live datanodes 2 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached. The datanode logs appear to be normal, with no errors indicated. The dfsadmin report says the datanodes are both normal and that the name node is in contact with them. Safe mode is ON Configured Capacity: 16637566976 (15.49 GB) Present Capacity: 7941234688 (7.40 GB) DFS Remaining: 7940620288 (7.40 GB) DFS Used: 614400 (600 KB) DFS Used%: 0.01% Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 Missing blocks (with replication factor 1): 0 ------------------------------------------------- Live datanodes (2): Name: 172.31.52.176:50010 (dev2) Hostname: dev2 Decommission Status : Normal Configured Capacity: 8318783488 (7.75 GB) DFS Used: 307200 (300 KB) Non DFS Used: 3257020416 (3.03 GB) DFS Remaining: 5061455872 (4.71 GB) DFS Used%: 0.00% DFS Remaining%: 60.84% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 1 Last contact: Tue Oct 04 15:47:00 EDT 2016 Name: 172.31.63.188:50010 (dev1) Hostname: dev1 Decommission Status : Normal Configured Capacity: 8318783488 (7.75 GB) DFS Used: 307200 (300 KB) Non DFS Used: 5439311872 (5.07 GB) DFS Remaining: 2879164416 (2.68 GB) DFS Used%: 0.00% DFS Remaining%: 34.61% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 1 Last contact: Tue Oct 04 15:47:00 EDT 2016 If I force the name node out of safe mode, the fsck commmand says that the file system is corrupt. When this happens, the only thing I've been able to do to get it back is to format the HDFS file system. I have not changed the configuration of the cluster. This just randomly seems to occur. The system is in development, but this will be unacceptable in production. I’m using version 2.7.3. Thank you in advance for any help.
