zhangshuyan0 opened a new pull request, #5938: URL: https://github.com/apache/hadoop/pull/5938
When the datanode completes a block recovery, it will call `commitBlockSynchronization` method to notify NN the new locations of the block. For a EC block group, NN determines the block index of each storage based on its position in the parameter `newtargets`. https://github.com/apache/hadoop/blob/b6edcb9a84ceac340c79cd692637b3e11c997cc5/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java#L4059-L4081 If the internal blocks written by the client don't have continuous indices, the current datanode code might cause NN to record incorrect block metadata. For simplicity, let's take RS (3,2) as an example. The timeline of the problem is as follows: 1. The client plans to write internal blocks with indices [0,1,2,3,4] to datanode [dn0, dn1, dn2, dn3, dn4] respectively. But dn1 is unable to connect, so the client only writes data to the remaining 4 datanodes; 2. Client crashes; 3. NN fails over; 4. Now the content of `uc. getExpectedStorageLocations()` in new ANN completely depends on block reports, and now it is <dn0, dn2, dn3, dn4>; 5. When the lease expires hard limit, NN issues a block recovery command; 6. Datanode that receives the recovery command fills `DatanodeID [] newLocs` with [dn0, null, dn2, dn3, dn4]; https://github.com/apache/hadoop/blob/b6edcb9a84ceac340c79cd692637b3e11c997cc5/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockRecoveryWorker.java#L471-L480 8. The serialization process filters out null values, so the parameters passed to NN become [dn0, dn2, dn3, dn4]; https://github.com/apache/hadoop/blob/b6edcb9a84ceac340c79cd692637b3e11c997cc5/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocolPB/DatanodeProtocolClientSideTranslatorPB.java#L322-L328 10. NN mistakenly believes that dn2 stores an internal block with index 1, dn3 stores an internal block with index 2, and so on. https://github.com/apache/hadoop/blob/b6edcb9a84ceac340c79cd692637b3e11c997cc5/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java#L4068-L4080 The above timeline is just an example, and there are other situations that may result in the same error, such as an update pipeline occurs on the client side. We should fix this bug. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
