[
https://issues.apache.org/jira/browse/HADOOP-16284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wei-Chiu Chuang updated HADOOP-16284:
-------------------------------------
Attachment: 4 kms, no KTS patch.png
> KMS Cache Miss Storm
> --------------------
>
> Key: HADOOP-16284
> URL: https://issues.apache.org/jira/browse/HADOOP-16284
> Project: Hadoop Common
> Issue Type: Bug
> Components: kms
> Affects Versions: 2.6.0
> Environment: CDH 5.13.1, Kerberized, Cloudera Keytrustee Server
> Reporter: Wei-Chiu Chuang
> Priority: Major
> Attachments: 4 kms, no KTS patch.png
>
>
> We recently stumble upon a performance issue with KMS, where occasionally it
> exhibited "No content to map" error (this cluster ran an old version that
> doesn't have HADOOP-14841) and jobs crashed. *We bumped the number of KMSes
> from 2 to 4, and situation went even worse.*
> Later, we realized this cluster had a few hundred encryption zones and a few
> hundred encryption keys. This is pretty unusual because most of the
> deployments known to us has at most a dozen keys. So in terms of number of
> keys, this cluster is 1-2 order of magnitude higher than any one else.
> The high number of encryption keys in creases the likelihood of key cache
> miss in KMS. In Cloudera's setup, each cache miss forces KMS to sync with its
> backend, the Cloudera Keytrustee Server. Plus the high number of KMSes
> amplifies the latency, effectively causing a [cache miss
> storm|https://en.wikipedia.org/wiki/Cache_stampede].
> We were able to reproduce this issue with KMS-o-meter (HDFS-14312) - I will
> come up with a better name later surely - and discovered a scalability bug in
> CKTS. The fix was verified again with the tool.
> Filing this bug so the community is aware of this issue. I don't have a
> solution for now in KMS. But we want to address this scalability problem in
> the near future because we are seeing use cases that requires thousands of
> encryption keys.
> ----
> On a side note, 4 KMS doesn't work well without HADOOP-14445 (and subsequent
> fixes). A MapReduce job acquires at most 3 KMS delegation tokens, and so for
> cases, such as distcp, it wouldn fail to reach the 4th KMS on the remote
> cluster. I imagine similar issues exist for other execution engines, but I
> didn't test.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]