[jira] [Comment Edited] (HADOOP-16284) KMS Cache Miss Storm

Wei-Chiu Chuang (JIRA) Tue, 07 May 2019 19:01:44 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-16284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16835245#comment-16835245
 ]


Wei-Chiu Chuang edited comment on HADOOP-16284 at 5/8/19 2:00 AM:
------------------------------------------------------------------

{quote}Do you know why the number of keys is relevant? Is the key cache 
evicting them due to size or the accesses for a particular key are more 
distributed over time vs a few highly contended keys?
{quote}
I don't manage the KMS key provider backend (CKTS) so I am afraid I can't offer 
the implementation details. IIRC, the minimum latency we observed was around 
100 ms (each KMS to CKTS connection involves PGP computation and other stuff so 
tend to be slow). I am not very sure if the latency is proportional to the 
number of encryption keys we have, but it's proportional to the number of KMS, 
because the backend has a global write lock design, and only one request is 
allowed at a time.

We saw key provider latency going as high as 20 seconds each during test when 
there are 4 KMSes. Consider an extreme case when you start KMS cold and that 
you have many encryption zone/keys, it is likely to trigger multiple cache 
misses consecutively immediately after restart. In this case, we observed KMS 
outage for several minutes after a KMS restart. After the KMS stabilizes, some 
encryption keys are rarely used and when they are used, they trigger cache miss 
from time to time.

!4 kms, no KTS patch.png|width=512!

Additionally, there's already a production workload for KMS, and KMS runs out 
of threads easily. We actually saw "No content to map" exception despite very 
low CPU utilization, and we were puzzled at first.


was (Author: jojochuang):
{quote}Do you know why the number of keys is relevant? Is the key cache 
evicting them due to size or the accesses for a particular key are more 
distributed over time vs a few highly contended keys?
{quote}
I don't manage the KMS key provider backend (CKTS) so I am afraid I can't offer 
the implementation details. IIRC, the minimum latency we observed was around 
100 ms (each KMS to CKTS connection involves PGP computation and other stuff so 
tend to be slow). I am not very sure if the latency is proportional to the 
number of encryption keys we have, but it's proportional to the number of KMS, 
because the backend has a global write lock design, and only one request is 
allowed at a time.

Se saw key provider latency going as high as 20 seconds each during test when 
there are 4 KMSes. Consider an extreme case when you start KMS cold and that 
you have many encryption zone/keys, it is likely to trigger multiple cache 
misses consecutively immediately after restart. In this case, we observed KMS 
outage for several minutes after a KMS restart. After the KMS stabilizes, some 
encryption keys are rarely used and when they are used, they trigger cache miss 
from time to time.

!4 kms, no KTS patch.png!

Additionally, there's already a production workload for KMS, and KMS runs out 
of threads easily. We actually saw "No content to map" exception despite very 
low CPU utilization, and we were puzzled at first.

> KMS Cache Miss Storm
> --------------------
>
>                 Key: HADOOP-16284
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16284
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: kms
>    Affects Versions: 2.6.0
>         Environment: CDH 5.13.1, Kerberized, Cloudera Keytrustee Server
>            Reporter: Wei-Chiu Chuang
>            Priority: Major
>         Attachments: 4 kms, no KTS patch.png
>
>
> We recently stumble upon a performance issue with KMS, where occasionally it 
> exhibited "No content to map" error (this cluster ran an old version that 
> doesn't have HADOOP-14841) and jobs crashed. *We bumped the number of KMSes 
> from 2 to 4, and situation went even worse.*
> Later, we realized this cluster had a few hundred encryption zones and a few 
> hundred encryption keys. This is pretty unusual because most of the 
> deployments known to us has at most a dozen keys. So in terms of number of 
> keys, this cluster is 1-2 order of magnitude higher than any one else.
> The high number of encryption keys in creases the likelihood of key cache 
> miss in KMS. In Cloudera's setup, each cache miss forces KMS to sync with its 
> backend, the Cloudera Keytrustee Server. Plus the high number of KMSes 
> amplifies the latency, effectively causing a [cache miss 
> storm|https://en.wikipedia.org/wiki/Cache_stampede].
> We were able to reproduce this issue with KMS-o-meter (HDFS-14312) - I will 
> come up with a better name later surely - and discovered a scalability bug in 
> CKTS. The fix was verified again with the tool.
> Filing this bug so the community is aware of this issue. I don't have a 
> solution for now in KMS. But we want to address this scalability problem in 
> the near future because we are seeing use cases that requires thousands of 
> encryption keys.
> ----
> On a side note, 4 KMS doesn't work well without HADOOP-14445 (and subsequent 
> fixes). A MapReduce job acquires at most 3 KMS delegation tokens, and so for 
> cases, such as distcp, it wouldn fail to reach the 4th KMS on the remote 
> cluster. I imagine similar issues exist for other execution engines, but I 
> didn't test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HADOOP-16284) KMS Cache Miss Storm

Reply via email to