Dear Hadoop community,
We found that DFSClient failed to read block since token is expired.
Here is the detail:
1. DFSClient tried to connect datanode, but doSaslHandshake failed with
IOException, then DFSClient added the datanode to deadNodes list:
2019-12-06 14:25:52,911 WARN org.apache.hadoop.hdfs.DFSClient: Failed
to connect to /192.168.5.183:50010 for block, add to deadNodes and
continue. java.io.IOException: DIGEST-MD5: IO error acquiring password
java.io.IOException: DIGEST-MD5: IO error acquiring password
at
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessageAndNegotiatedCipherOption(DataTransferSaslUtil.java:419)
at
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:474)
at
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:391)
at
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263)
at
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211)
at
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160)
at
org.apache.hadoop.hdfs.net.TcpPeerServer.peerFromSocketAndKey(TcpPeerServer.java:90)
at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3453)
at
org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:789)
at
org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:695)
at
org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:356)
at
org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:660)
at
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:886)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:938)
at java.io.DataInputStream.read(DataInputStream.java:149)
2. From the datanode's log, we can see it is caused by InvalidToken:
2019-12-06 14:25:52,908 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode: tdh185:50010:DataXceiver
error processing unknown operation src: /192.168.5.181:52500 dst: /
192.168.5.183:50010
javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring
password [Caused by
org.apache.hadoop.security.token.SecretManager$InvalidToken: Block token
with block_token_identifier (expiryDate=1575548483208, keyId=-1555008939,
userId=hbase, blockPoolId=BP-1371500706-192.168.5.181-1560479016519,
blockId=1109964016, access modes=[READ]) is expired.]
at
com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:594)
at
com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
at
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115)
at
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:373)
at
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getSaslStreams(SaslDataTransferServer.java:298)
at
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:125)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:193)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken:
Block token with block_token_identifier (expiryDate=1575548483208,
keyId=-1555008939, userId=hbase,
blockPoolId=BP-1371500706-192.168.5.181-1560479016519, blockId=1109964016,
access modes=[READ]) is expired.
3. Since there are 3 replicas, DFSClient would also try the other 2
datanodes and got same errors:
2019-12-06 14:25:53,819 INFO org.apache.hadoop.hdfs.DFSClient: Could
not obtain
BP-1371500706-192.168.5.181-1560479016519:blk_1109964016_36223686 from any
node: java.io.IOException: No live nodes contain block
BP-1371500706-192.168.5.181-1560479016519:blk_1109964016_36223686 after
checking nodes =
[DatanodeInfoWithStorage[192.168.5.183:50010,DS-53265352-2c9c-4c36-a923-c764028a32bf,DISK],
DatanodeInfoWithStorage[192.168.5.185:50010,DS-5496b545-e3c6-4414-9a05-adb6930ae29e,DISK],
DatanodeInfoWithStorage[192.168.5.186:50010,DS-dfbbfa07-7d2c-48fa-997a-d35cb8d700ac,DISK]],
ignoredNodes = null No live nodes contain current block Block locations:
DatanodeInfoWithStorage[192.168.5.183:50010,DS-53265352-2c9c-4c36-a923-c764028a32bf,DISK]
DatanodeInfoWithStorage[192.168.5.185:50010,DS-5496b545-e3c6-4414-9a05-adb6930ae29e,DISK]
DatanodeInfoWithStorage[192.168.5.186:50010,DS-dfbbfa07-7d2c-48fa-997a-d35cb8d700ac,DISK]
Dead nodes:
DatanodeInfoWithStorage[192.168.5.183:50010,DS-53265352-2c9c-4c36-a923-c764028a32bf,DISK]
DatanodeInfoWithStorage[192.168.5.186:50010,DS-dfbbfa07-7d2c-48fa-997a-d35cb8d700ac,DISK]
DatanodeInfoWithStorage[192.168.5.185:50010,DS-5496b545-e3c6-4414-9a05-adb6930ae29e,DISK].
Will get new block locations from namenode and retry...
------
As you can see, the key is DFSClient cannot know the root cause:
InvalidToken.
Because datanode checks InvalidEncryptionKeyException only. For
InvalidToken, it just returns a general SASL error.
We can see that from SaslDataTransferServer.doSaslHandshake:
try {
...
localResponse = sasl.evaluateChallengeOrResponse(remoteResponse);
...
} catch (IOException ioe) {
if (ioe instanceof SaslException &&
ioe.getCause() != null &&
ioe.getCause() instanceof InvalidEncryptionKeyException) {
// This could just be because the client is long-lived and hasn't
gotten
// a new encryption key from the NN in a while. Upon receiving this
// error, the client will get a new encryption key from the NN and
retry
// connecting to this DN.
sendInvalidKeySaslErrorMessage(out, ioe.getCause().getMessage());
} else {
sendGenericSaslErrorMessage(out, ioe.getMessage());
}
throw ioe;
}
So could you help explain why InvalidToken is not checked? Afterall, if the
client knows the root cause, then it could just re-fetch token from
namenode immediately.
Many thanks!
--
Best Regards!
Wenqi