[
https://issues.apache.org/jira/browse/GEODE-10056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jakov Varenina resolved GEODE-10056.
------------------------------------
Fix Version/s: 1.16.0
Resolution: Fixed
> Gateway-reciver connection load mantained only on one locator
> -------------------------------------------------------------
>
> Key: GEODE-10056
> URL: https://issues.apache.org/jira/browse/GEODE-10056
> Project: Geode
> Issue Type: Bug
> Reporter: Jakov Varenina
> Assignee: Jakov Varenina
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.16.0
>
>
> The first problem is that servers send incorrect gateway-receiver connection
> load to locators with CacheServerLoadMessage. The second problem is that the
> locator doesn't refresh gateway-receivers load per server in the local map
> with the load received in CacheServerLoadMessage. This seems to be a bug, as
> there is already a mechanism to track and store gateway-receiver connection
> load per server in the locator, but that load is never refreshed by a fault
> at the reception of CacheServerLoadMessage. Currently, receiver load is only
> refreshed/increased on the locator that is handling
> ClientConnectionRequest\{group=__recv_group...} and ClientConnectionResponse
> messages from a remote server that is trying to establish gateway sender
> connection. All other locators in a cluster will never refresh the
> gateway-receiver connection load in this case. When the locator that was
> serving remote gateway-senders goes down then a new locator will take that
> job. Problem is that the new locator will not have a correct load (it was
> never refreshed) and that would in most situations result in new
> gateway-sender connections being established in an unbalanced way.
> Way to reproduce the issue:
> Start 2 clusters, Let's call site1 the sending and site2 the receiving site,
> The receiving site should have at least 2 locators. Both have 2 servers. No
> regions are needed.
> Cluster-1 gfsh>list members
> Member Count : 3Name | Id
> --------- | -------------------------------------------------------------
> locator10 | 10.0.2.15(locator10:7332:locator)<ec><v0>:41000 [Coordinator]
> server11 | 10.0.2.15(server11:8358)<v1>:41003
> server12 | 10.0.2.15(server12:8717)<v2>:41005
>
> Cluster-2 gfsh>list members
> Member Count : 4Name | Id
> --------- | -------------------------------------------------------------
> locator10 | 10.0.2.15(locator10:7562:locator)<ec><v0>:41001 [Coordinator]
> locator11 | 10.0.2.15(locator11:8103:locator)<ec><v1>:41002
> server11 | 10.0.2.15(server11:8547)<v2>:41004
> server12 | 10.0.2.15(server12:8908)<v3>:41006
>
> Create GW receiver in Site2 on both servers.
>
> Cluster-2 gfsh>list gateways
> GatewayReceiver Section Member | Port | Sender
> Count | Senders Connected
> ---------------------------------- | ---- | ------------ | -----------------
> 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 0 |
> 10.0.2.15(server12:8908)<v3>:41006 | 5457 | 0 |
> Create GW sender in Site1 on both servers. Use 10 dispatcher threads for
> easier obervation.
> Cluster-1 gfsh>list gateways
> GatewaySender SectionGatewaySender Id | Member |
> Remote Cluster Id | Type | Status | Queued Events |
> Receiver Location
> ---------------- | ---------------------------------- | ----------------- |
> -------- | --------------------- | ------------- | -----------------
> senderTo2 | 10.0.2.15(server11:8358)<v1>:41003 | 2 |
> Parallel | Running and Connected | 0 | 10.0.2.15:5457
> senderTo2 | 10.0.2.15(server12:8717)<v2>:41005 | 2 |
> Parallel | Running and Connected | 0 | 10.0.2.15:5457
>
> Observe balance in GW receiver connections in Site2. It will be perfect.
>
> Cluster-2 gfsh>list gateways
> GatewayReceiver Section Member | Port | Sender
> Count | Senders Connected
> ---------------------------------- | ---- | ------------ |
> ---------------------------------------------------------------------------------------------------------------------------------
> 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 12 |
> 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005,
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:..
> 10.0.2.15(server12:8908)<v3>:41006 | 5457 | 12 |
> 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005,
> 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:..
>
> 12 connections each - 10 payload + 2 ping connections.
> Now stop GW receiver in one server of site2. In Site1 do a stop/start
> gateway-sender command - all connections will go to the only receiver in
> site2 (as expected). Check it:
>
> Cluster-2 gfsh>list gateways
> GatewayReceiver Section Member | Port | Sender
> Count | Senders Connected
> ---------------------------------- | ---- | ------------ |
> ---------------------------------------------------------------------------------------------------------------------------------
> 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 22 |
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server12:8717)<v2>:41005,
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:..
> 10.0.2.15(server12:8908)<v3>:41006 | 5457 | 0 |
>
> Now 22 in just one receiver - 20 payload + 1 ping from each sender.
> Stop GW sender in one server in Site1. Connection drops in GW receiver to
> half the value (also expected).
>
> Cluster-2 gfsh>list gateways
> GatewayReceiver Section Member | Port | Sender
> Count | Senders Connected
> ---------------------------------- | ---- | ------------ |
> ---------------------------------------------------------------------------------------------------------------------------------
> 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 11 |
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003,
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:..
> 10.0.2.15(server12:8908)<v3>:41006 | 5457 | 0 |
>
> Now 11 as one sender from Site1 is stopped.
> Start the GW receiver in server of site2 (that was stopped before). It will
> not receive new connections just yet.
> Start GW sender in one server in Site1 (that was stopped before). All
> connections will land in receiver started before so the balance is there.
>
> Cluster-2 gfsh>list gateways
> GatewayReceiver Section Member | Port | Sender
> Count | Senders Connected
> ---------------------------------- | ---- | ------------ |
> ---------------------------------------------------------------------------------------------------------------------------------
> 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 11 |
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003,
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:..
> 10.0.2.15(server12:8908)<v3>:41006 | 5182 | 11 |
> 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005,
> 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:..
>
> 11 connections in each because we have perfect mapping server11 to server11
> and server12 to server12 (i.e. there is just 1 ping connection in each
> receiver). As expected - we see how balance was achieved. Stop GW sender in
> same server in Site1 again. Again, no connections in receiver of Site2 we
> just started (expected).
>
> Cluster-2 gfsh>list gateways
> GatewayReceiver Section Member | Port | Sender
> Count | Senders Connected
> ---------------------------------- | ---- | ------------ |
> ---------------------------------------------------------------------------------------------------------------------------------
> 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 11 |
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003,
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:..
> 10.0.2.15(server12:8908)<v3>:41006 | 5182 | 0 |
>
> Now stop one locator in Site2 - the one that was serving GW senders - it was
> locator10 in my case. Start GW sender in that server of Site1 again. Check
> the balance in Site2 GW receiver:
> Cluster-2 gfsh>list gateways
> GatewayReceiver Section Member | Port | Sender
> Count | Senders Connected
> ---------------------------------- | ---- | ------------ |
> ---------------------------------------------------------------------------------------------------------------------------------
> 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 17 |
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003,
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:..
> 10.0.2.15(server12:8908)<v3>:41006 | 5182 | 6 |
> 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005,
> 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:..
> As you can see in above printout, connections aren't balanced correctly when
> connection request is sent to new locator.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)