[ https://issues.apache.org/jira/browse/GEODE-10056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Anilkumar Gingade updated GEODE-10056: -------------------------------------- Labels: pull-request-available (was: needsTriage pull-request-available) > Gateway-reciver connection load mantained only on one locator > ------------------------------------------------------------- > > Key: GEODE-10056 > URL: https://issues.apache.org/jira/browse/GEODE-10056 > Project: Geode > Issue Type: Bug > Reporter: Jakov Varenina > Assignee: Jakov Varenina > Priority: Major > Labels: pull-request-available > > The first problem is that servers send incorrect gateway-receiver connection > load to locators with CacheServerLoadMessage. The second problem is that the > locator doesn't refresh gateway-receivers load per server in the local map > with the load received in CacheServerLoadMessage. This seems to be a bug, as > there is already a mechanism to track and store gateway-receiver connection > load per server in the locator, but that load is never refreshed by a fault > at the reception of CacheServerLoadMessage. Currently, receiver load is only > refreshed/increased on the locator that is handling > ClientConnectionRequest\{group=__recv_group...} and ClientConnectionResponse > messages from a remote server that is trying to establish gateway sender > connection. All other locators in a cluster will never refresh the > gateway-receiver connection load in this case. When the locator that was > serving remote gateway-senders goes down then a new locator will take that > job. Problem is that the new locator will not have a correct load (it was > never refreshed) and that would in most situations result in new > gateway-sender connections being established in an unbalanced way. > Way to reproduce the issue: > Start 2 clusters, Let's call site1 the sending and site2 the receiving site, > The receiving site should have at least 2 locators. Both have 2 servers. No > regions are needed. > Cluster-1 gfsh>list members > Member Count : 3Name | Id > --------- | ------------------------------------------------------------- > locator10 | 10.0.2.15(locator10:7332:locator)<ec><v0>:41000 [Coordinator] > server11 | 10.0.2.15(server11:8358)<v1>:41003 > server12 | 10.0.2.15(server12:8717)<v2>:41005 > > Cluster-2 gfsh>list members > Member Count : 4Name | Id > --------- | ------------------------------------------------------------- > locator10 | 10.0.2.15(locator10:7562:locator)<ec><v0>:41001 [Coordinator] > locator11 | 10.0.2.15(locator11:8103:locator)<ec><v1>:41002 > server11 | 10.0.2.15(server11:8547)<v2>:41004 > server12 | 10.0.2.15(server12:8908)<v3>:41006 > > Create GW receiver in Site2 on both servers. > > Cluster-2 gfsh>list gateways > GatewayReceiver Section Member | Port | Sender > Count | Senders Connected > ---------------------------------- | ---- | ------------ | ----------------- > 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 0 | > 10.0.2.15(server12:8908)<v3>:41006 | 5457 | 0 | > Create GW sender in Site1 on both servers. Use 10 dispatcher threads for > easier obervation. > Cluster-1 gfsh>list gateways > GatewaySender SectionGatewaySender Id | Member | > Remote Cluster Id | Type | Status | Queued Events | > Receiver Location > ---------------- | ---------------------------------- | ----------------- | > -------- | --------------------- | ------------- | ----------------- > senderTo2 | 10.0.2.15(server11:8358)<v1>:41003 | 2 | > Parallel | Running and Connected | 0 | 10.0.2.15:5457 > senderTo2 | 10.0.2.15(server12:8717)<v2>:41005 | 2 | > Parallel | Running and Connected | 0 | 10.0.2.15:5457 > > Observe balance in GW receiver connections in Site2. It will be perfect. > > Cluster-2 gfsh>list gateways > GatewayReceiver Section Member | Port | Sender > Count | Senders Connected > ---------------------------------- | ---- | ------------ | > --------------------------------------------------------------------------------------------------------------------------------- > 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 12 | > 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005, > 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:.. > 10.0.2.15(server12:8908)<v3>:41006 | 5457 | 12 | > 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005, > 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:.. > > 12 connections each - 10 payload + 2 ping connections. > Now stop GW receiver in one server of site2. In Site1 do a stop/start > gateway-sender command - all connections will go to the only receiver in > site2 (as expected). Check it: > > Cluster-2 gfsh>list gateways > GatewayReceiver Section Member | Port | Sender > Count | Senders Connected > ---------------------------------- | ---- | ------------ | > --------------------------------------------------------------------------------------------------------------------------------- > 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 22 | > 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server12:8717)<v2>:41005, > 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:.. > 10.0.2.15(server12:8908)<v3>:41006 | 5457 | 0 | > > Now 22 in just one receiver - 20 payload + 1 ping from each sender. > Stop GW sender in one server in Site1. Connection drops in GW receiver to > half the value (also expected). > > Cluster-2 gfsh>list gateways > GatewayReceiver Section Member | Port | Sender > Count | Senders Connected > ---------------------------------- | ---- | ------------ | > --------------------------------------------------------------------------------------------------------------------------------- > 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 11 | > 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003, > 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:.. > 10.0.2.15(server12:8908)<v3>:41006 | 5457 | 0 | > > Now 11 as one sender from Site1 is stopped. > Start the GW receiver in server of site2 (that was stopped before). It will > not receive new connections just yet. > Start GW sender in one server in Site1 (that was stopped before). All > connections will land in receiver started before so the balance is there. > > Cluster-2 gfsh>list gateways > GatewayReceiver Section Member | Port | Sender > Count | Senders Connected > ---------------------------------- | ---- | ------------ | > --------------------------------------------------------------------------------------------------------------------------------- > 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 11 | > 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003, > 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:.. > 10.0.2.15(server12:8908)<v3>:41006 | 5182 | 11 | > 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005, > 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:.. > > 11 connections in each because we have perfect mapping server11 to server11 > and server12 to server12 (i.e. there is just 1 ping connection in each > receiver). As expected - we see how balance was achieved. Stop GW sender in > same server in Site1 again. Again, no connections in receiver of Site2 we > just started (expected). > > Cluster-2 gfsh>list gateways > GatewayReceiver Section Member | Port | Sender > Count | Senders Connected > ---------------------------------- | ---- | ------------ | > --------------------------------------------------------------------------------------------------------------------------------- > 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 11 | > 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003, > 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:.. > 10.0.2.15(server12:8908)<v3>:41006 | 5182 | 0 | > > Now stop one locator in Site2 - the one that was serving GW senders - it was > locator10 in my case. Start GW sender in that server of Site1 again. Check > the balance in Site2 GW receiver: > Cluster-2 gfsh>list gateways > GatewayReceiver Section Member | Port | Sender > Count | Senders Connected > ---------------------------------- | ---- | ------------ | > --------------------------------------------------------------------------------------------------------------------------------- > 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 17 | > 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003, > 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:.. > 10.0.2.15(server12:8908)<v3>:41006 | 5182 | 6 | > 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005, > 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:.. > As you can see in above printout, connections aren't balanced correctly when > connection request is sent to new locator. -- This message was sent by Atlassian Jira (v8.20.1#820001)