[
https://issues.apache.org/jira/browse/KAFKA-7821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16743308#comment-16743308
]
Guozhang Wang commented on KAFKA-7821:
--------------------------------------
[~MJarvie] Thanks for reporting the issue. My conjecture is that the cache
itself is not the root cause, but the underlying store itself had some issue
that it does not return all the sessions given a findSessions query. This issue
is hidden with a larger cache size because one can still find the session in
cache since it is big enough to not evict it. This can be verified by just
turning off cache at all and see if this issue still persists.
One thing that still bothers me is that, the bug we found is supposed to only
start in 0.11.0, but not 0.10.2 releases. So I'd suggest you to try out the fix
from trunk (it would be compatible with 2.1.0):
https://github.com/apache/kafka/pull/6134#issuecomment-453702517 and see if it
resolves the issue even without caching on top of the store (doable via
`withCaching(false)`).
> Streams: default cache size can lose session windows in high-throughput
> deployment
> ----------------------------------------------------------------------------------
>
> Key: KAFKA-7821
> URL: https://issues.apache.org/jira/browse/KAFKA-7821
> Project: Kafka
> Issue Type: Bug
> Components: streams
> Affects Versions: 0.10.2.1, 2.1.0
> Reporter: Matthew Jarvie
> Priority: Major
>
> We have observed that with a default cache size, a Streams aggregator will
> sometimes fail to find existing, open session windows while handling records.
> The effect is that it starts a new session and overwrites the old one and
> events fail to aggregate together.
> Our topology is fairly simple: We consume from a Kafka topic, group by keys,
> aggregate, then produce to another topic. Our aggregator is configured to use
> a window session strategy with an inactivity gap of 10 minutes and a
> retention period of 10 minutes. The system is deployed in production and
> handles about 250k messages per thread per minute (4 threads per
> application). The cache size is left default (10 MB).
> We worked around the issue by enlarging the cache (cache.max.bytes.buffering
> configuration parameter from 10 MB to 100MB) and no longer observe the issue
> at all. While troubleshooting, we noticed that older sessions would be the
> ones lost, so it seems like the cache is an LRU cache and is evicting windows
> before their inactivity time is up.
> This was originally observed in 10.2.1. We completed an upgrade to 2.1.0 and
> still observed the issue.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)