[
https://issues.apache.org/jira/browse/KAFKA-8994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954789#comment-16954789
]
Vinoth Chandar commented on KAFKA-8994:
---------------------------------------
Closing this, and merging this scope into KAFKA-6144
> Streams should expose standby replication information & allow stale reads of
> state store
> ----------------------------------------------------------------------------------------
>
> Key: KAFKA-8994
> URL: https://issues.apache.org/jira/browse/KAFKA-8994
> Project: Kafka
> Issue Type: New Feature
> Components: streams
> Reporter: Vinoth Chandar
> Assignee: Vinoth Chandar
> Priority: Major
> Labels: needs-kip
>
> Currently Streams interactive queries (IQ) fail during the time period where
> there is a rebalance in progress.
> Consider the following scenario in a three node Streams cluster with node A,
> node S and node R, executing a stateful sub-topology/topic group with 1
> partition and `_num.standby.replicas=1_`
> * *t0*: A is the active instance owning the partition, B is the standby that
> keeps replicating the A's state into its local disk, R just routes streams
> IQs to active instance using StreamsMetadata
> * *t1*: IQs pick node R as router, R forwards query to A, A responds back to
> R which reverse forwards back the results.
> * *t2:* Active A instance is killed and rebalance begins. IQs start failing
> to A
> * *t3*: Rebalance assignment happens and standby B is now promoted as active
> instance. IQs continue to fail
> * *t4*: B fully catches up to changelog tail and rewinds offsets to A's last
> commit position, IQs continue to fail
> * *t5*: IQs to R, get routed to B, which is now ready to serve results. IQs
> start succeeding again
>
> Depending on Kafka consumer group session/heartbeat timeouts, step t2,t3 can
> take few seconds (~10 seconds based on defaults values). Depending on how
> laggy the standby B was prior to A being killed, t4 can take few
> seconds-minutes.
> While this behavior favors consistency over availability at all times, the
> long unavailability window might be undesirable for certain classes of
> applications (e.g simple caches or dashboards).
> This issue aims to also expose information about standby B to R, during each
> rebalance such that the queries can be routed by an application to a standby
> to serve stale reads, choosing availability over consistency.
>
>
>
>
>
>
>
>
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)