[jira] [Updated] (HADOOP-16284) KMS Cache Miss Storm

Wei-Chiu Chuang (JIRA) Tue, 07 May 2019 18:58:56 -0700


     [ 
https://issues.apache.org/jira/browse/HADOOP-16284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Wei-Chiu Chuang updated HADOOP-16284:
-------------------------------------
    Attachment: 4 kms, no KTS patch.png

> KMS Cache Miss Storm
> --------------------
>
>                 Key: HADOOP-16284
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16284
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: kms
>    Affects Versions: 2.6.0
>         Environment: CDH 5.13.1, Kerberized, Cloudera Keytrustee Server
>            Reporter: Wei-Chiu Chuang
>            Priority: Major
>         Attachments: 4 kms, no KTS patch.png
>
>
> We recently stumble upon a performance issue with KMS, where occasionally it 
> exhibited "No content to map" error (this cluster ran an old version that 
> doesn't have HADOOP-14841) and jobs crashed. *We bumped the number of KMSes 
> from 2 to 4, and situation went even worse.*
> Later, we realized this cluster had a few hundred encryption zones and a few 
> hundred encryption keys. This is pretty unusual because most of the 
> deployments known to us has at most a dozen keys. So in terms of number of 
> keys, this cluster is 1-2 order of magnitude higher than any one else.
> The high number of encryption keys in creases the likelihood of key cache 
> miss in KMS. In Cloudera's setup, each cache miss forces KMS to sync with its 
> backend, the Cloudera Keytrustee Server. Plus the high number of KMSes 
> amplifies the latency, effectively causing a [cache miss 
> storm|https://en.wikipedia.org/wiki/Cache_stampede].
> We were able to reproduce this issue with KMS-o-meter (HDFS-14312) - I will 
> come up with a better name later surely - and discovered a scalability bug in 
> CKTS. The fix was verified again with the tool.
> Filing this bug so the community is aware of this issue. I don't have a 
> solution for now in KMS. But we want to address this scalability problem in 
> the near future because we are seeing use cases that requires thousands of 
> encryption keys.
> ----
> On a side note, 4 KMS doesn't work well without HADOOP-14445 (and subsequent 
> fixes). A MapReduce job acquires at most 3 KMS delegation tokens, and so for 
> cases, such as distcp, it wouldn fail to reach the 4th KMS on the remote 
> cluster. I imagine similar issues exist for other execution engines, but I 
> didn't test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HADOOP-16284) KMS Cache Miss Storm

Reply via email to