[jira] [Commented] (KAFKA-12677) The raftCluster always send to the wrong active controller and never update

Colin McCabe (Jira) Fri, 16 Apr 2021 10:16:04 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-12677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323955#comment-17323955
 ]


Colin McCabe commented on KAFKA-12677:
--------------------------------------

[~showuon]: thanks for looking at this.  This test failure has been frustrating 
everyone for a while.  It's great that you reproduced it with DEBUG logging.

bq. The raftCluster always send to the wrong active controller and never update

Hmm... If you do a grep for "Recorded new controller" you can see that it does 
change eventually.

{code}
[cmccabe@zeratul Downloads]$ grep 'Recorded new controller' 
raftClusterTest_bug.txt 
2021-04-13T13:40:21.708+0800 [DEBUG] [TestEventLogger]     [2021-04-13 
13:40:21,325] INFO [broker2_:BrokerToControllerChannelManager broker=2 
name=forwarding]: Recorded new controller, from now on will use broker 
localhost:54228 (id: 3002 rack: null) 
(kafka.server.BrokerToControllerRequestThread:66)
2021-04-13T13:40:21.711+0800 [DEBUG] [TestEventLogger]     [2021-04-13 
13:40:21,376] INFO [broker1_:BrokerToControllerChannelManager broker=1 
name=alterIsr]: Recorded new controller, from now on will use broker 
localhost:54228 (id: 3002 rack: null) 
(kafka.server.BrokerToControllerRequestThread:66)
2021-04-13T13:40:21.711+0800 [DEBUG] [TestEventLogger]     [2021-04-13 
13:40:21,376] INFO [broker2_:BrokerToControllerChannelManager broker=2 
name=alterIsr]: Recorded new controller, from now on will use broker 
localhost:54228 (id: 3002 rack: null) 
(kafka.server.BrokerToControllerRequestThread:66)
2021-04-13T13:40:21.711+0800 [DEBUG] [TestEventLogger]     [2021-04-13 
13:40:21,376] INFO [broker0_:BrokerToControllerChannelManager broker=0 
name=alterIsr]: Recorded new controller, from now on will use broker 
localhost:54228 (id: 3002 rack: null) 
(kafka.server.BrokerToControllerRequestThread:66)
2021-04-13T13:40:21.712+0800 [DEBUG] [TestEventLogger]     [2021-04-13 
13:40:21,421] INFO [broker1_:BrokerToControllerChannelManager broker=1 
name=heartbeat]: Recorded new controller, from now on will use broker 
localhost:54228 (id: 3002 rack: null) 
(kafka.server.BrokerToControllerRequestThread:66)
....
2021-04-13T13:40:21.837+0800 [DEBUG] [TestEventLogger]     [2021-04-13 
13:40:21,660] INFO [broker0_:BrokerToControllerChannelManager broker=0 
name=heartbeat]: Recorded new controller, from now on will use broker 
localhost:54228 (id: 3002 rack: null) 
(kafka.server.BrokerToControllerRequestThread:66)
2021-04-13T13:40:21.866+0800 [DEBUG] [TestEventLogger]     [2021-04-13 
13:40:21,866] INFO [broker2_:BrokerToControllerChannelManager broker=2 
name=heartbeat]: Recorded new controller, from now on will use broker 
localhost:54229 (id: 3000 rack: null) 
(kafka.server.BrokerToControllerRequestThread:66)
2021-04-13T13:40:22.324+0800 [DEBUG] [TestEventLogger]     [2021-04-13 
13:40:22,013] INFO [broker1_:BrokerToControllerChannelManager broker=1 
name=heartbeat]: Recorded new controller, from now on will use broker 
localhost:54229 (id: 3000 rack: null) 
(kafka.server.BrokerToControllerRequestThread:66)
2021-04-13T13:40:22.428+0800 [DEBUG] [TestEventLogger]     [2021-04-13 
13:40:22,073] INFO [broker0_:BrokerToControllerChannelManager broker=0 
name=heartbeat]: Recorded new controller, from now on will use broker 
localhost:54229 (id: 3000 rack: null) 
(kafka.server.BrokerToControllerRequestThread:66)
{code}

Maybe the question is why it takes so long to change our view of the controller 
on the broker side? That comes directly out of the raft client on the broker. 
cc [~hachikuji]

Another question is whether we should use the information passed back by the 
standby controller to fail over quicker.  You can see this in the error that 
was returned:

{code}
"errorCode":41,"errorMessage":"The active controller appears to be node 3000"
{code}

We could potentially try node 3000 after receiving that error (although this 
might complicate things in other ways...)

> The raftCluster always send to the wrong active controller and never update
> ---------------------------------------------------------------------------
>
>                 Key: KAFKA-12677
>                 URL: https://issues.apache.org/jira/browse/KAFKA-12677
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 2.8.0
>            Reporter: Luke Chen
>            Assignee: Colin McCabe
>            Priority: Blocker
>
> We introduce KIP-500 to introduce a Self-Managed Metadata Quorum. We should 
> always have 1 active controller, and all the RPC will send to the active 
> controller. But there's chances that the active controller already changed, 
> but the RPC still send to the old one.
> In the attachment log, we can see:
> {code:java}
> [Controller 3002] Becoming active at controller epoch 1. 
> ...
> [Controller 3000] Becoming active at controller epoch 2. 
> {code}
> So, the latest active controller should be 3000. But the create topic RPC are 
> all sending to controller 3002:
> {code:java}
> "errorMessage":"The active controller appears to be node 3000"
> {code}
> This bug causes the RaftClusterTests flaky.
>  
> Debug log while running testCreateClusterAndCreateListDeleteTopic test: 
> https://drive.google.com/file/d/1WVUgy1Erjx8mHyofiP9MVvQGb0LcDYt3/view?usp=sharing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-12677) The raftCluster always send to the wrong active controller and never update

Reply via email to