[jira] [Resolved] (GEODE-10056) Gateway-reciver connection load mantained only on one locator

Jakov Varenina (Jira) Thu, 08 Sep 2022 04:05:05 -0700


     [ 
https://issues.apache.org/jira/browse/GEODE-10056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jakov Varenina resolved GEODE-10056.
------------------------------------
    Fix Version/s: 1.16.0
       Resolution: Fixed

> Gateway-reciver connection load mantained only on one locator
> -------------------------------------------------------------
>
>                 Key: GEODE-10056
>                 URL: https://issues.apache.org/jira/browse/GEODE-10056
>             Project: Geode
>          Issue Type: Bug
>            Reporter: Jakov Varenina
>            Assignee: Jakov Varenina
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.16.0
>
>
> The first problem is that servers send incorrect gateway-receiver connection 
> load to locators with CacheServerLoadMessage. The second problem is that the 
> locator doesn't refresh gateway-receivers load per server in the local map 
> with the load received in CacheServerLoadMessage. This seems to be a bug, as 
> there is already a mechanism to track and store gateway-receiver connection 
> load per server in the locator, but that load is never refreshed by a fault 
> at the reception of CacheServerLoadMessage. Currently, receiver load is only 
> refreshed/increased on the locator that is handling 
> ClientConnectionRequest\{group=__recv_group...} and ClientConnectionResponse 
> messages from a remote server that is trying to establish gateway sender 
> connection. All other locators in a cluster will never refresh the 
> gateway-receiver connection load in this case. When the locator that was 
> serving remote gateway-senders goes down then a new locator will take that 
> job. Problem is that the new locator will not have a correct load (it was 
> never refreshed) and that would in most situations result in new 
> gateway-sender connections being established in an unbalanced way.
> Way to reproduce the issue:
> Start 2 clusters, Let's call site1 the sending and site2 the receiving site, 
> The receiving site should have at least 2 locators. Both have 2 servers. No 
> regions are needed.
> Cluster-1 gfsh>list members
> Member Count : 3Name | Id
> --------- | -------------------------------------------------------------
> locator10 | 10.0.2.15(locator10:7332:locator)<ec><v0>:41000 [Coordinator]
> server11 | 10.0.2.15(server11:8358)<v1>:41003
> server12 | 10.0.2.15(server12:8717)<v2>:41005
>  
> Cluster-2 gfsh>list members
> Member Count : 4Name | Id
> --------- | -------------------------------------------------------------
> locator10 | 10.0.2.15(locator10:7562:locator)<ec><v0>:41001 [Coordinator]
> locator11 | 10.0.2.15(locator11:8103:locator)<ec><v1>:41002
> server11 | 10.0.2.15(server11:8547)<v2>:41004
> server12 | 10.0.2.15(server12:8908)<v3>:41006
>  
> Create GW receiver in Site2 on both servers.
>  
> Cluster-2 gfsh>list gateways
> GatewayReceiver Section              Member               | Port | Sender 
> Count | Senders Connected
> ---------------------------------- | ---- | ------------ | -----------------
> 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 0            |
> 10.0.2.15(server12:8908)<v3>:41006 | 5457 | 0            |
> Create GW sender in Site1 on both servers. Use 10 dispatcher threads for 
> easier obervation. 
> Cluster-1 gfsh>list gateways
> GatewaySender SectionGatewaySender Id |               Member               | 
> Remote Cluster Id |   Type   |        Status         | Queued Events | 
> Receiver Location
> ---------------- | ---------------------------------- | ----------------- | 
> -------- | --------------------- | ------------- | -----------------
> senderTo2        | 10.0.2.15(server11:8358)<v1>:41003 | 2                 | 
> Parallel | Running and Connected | 0             | 10.0.2.15:5457
> senderTo2        | 10.0.2.15(server12:8717)<v2>:41005 | 2                 | 
> Parallel | Running and Connected | 0             | 10.0.2.15:5457
>  
> Observe balance in GW receiver connections in Site2. It will be perfect.
>  
> Cluster-2 gfsh>list gateways
> GatewayReceiver Section              Member               | Port | Sender 
> Count | Senders Connected
> ---------------------------------- | ---- | ------------ | 
> ---------------------------------------------------------------------------------------------------------------------------------
> 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 12           | 
> 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005, 
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:..
> 10.0.2.15(server12:8908)<v3>:41006 | 5457 | 12           | 
> 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005, 
> 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:..
>  
> 12 connections each - 10 payload + 2 ping connections.
> Now stop GW receiver in one server of site2. In Site1 do a stop/start 
> gateway-sender command - all connections will go to the only receiver in 
> site2 (as expected). Check it:
>  
> Cluster-2 gfsh>list gateways
> GatewayReceiver Section              Member               | Port | Sender 
> Count | Senders Connected
> ---------------------------------- | ---- | ------------ | 
> ---------------------------------------------------------------------------------------------------------------------------------
> 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 22           | 
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server12:8717)<v2>:41005, 
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:..
> 10.0.2.15(server12:8908)<v3>:41006 | 5457 | 0            |
>  
> Now 22 in just one receiver - 20 payload + 1 ping from each sender.
> Stop GW sender in one server in Site1. Connection drops in GW receiver to 
> half the value (also expected).
>  
> Cluster-2 gfsh>list gateways
> GatewayReceiver Section              Member               | Port | Sender 
> Count | Senders Connected
> ---------------------------------- | ---- | ------------ | 
> ---------------------------------------------------------------------------------------------------------------------------------
> 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 11           | 
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003, 
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:..
> 10.0.2.15(server12:8908)<v3>:41006 | 5457 | 0            |
>  
> Now 11 as one sender from Site1 is stopped.
> Start the GW receiver in server of site2 (that was stopped before). It will 
> not receive new connections just yet.
> Start GW sender in one server in Site1 (that was stopped before). All 
> connections will land in receiver started before so the balance is there.
>  
> Cluster-2 gfsh>list gateways
> GatewayReceiver Section              Member               | Port | Sender 
> Count | Senders Connected
> ---------------------------------- | ---- | ------------ | 
> ---------------------------------------------------------------------------------------------------------------------------------
> 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 11           | 
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003, 
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:..
> 10.0.2.15(server12:8908)<v3>:41006 | 5182 | 11           | 
> 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005, 
> 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:..
>  
> 11 connections in each because we have perfect mapping server11 to server11 
> and server12 to server12 (i.e. there is just 1 ping connection in each 
> receiver). As expected - we see how balance was achieved. Stop GW sender in 
> same server in Site1 again. Again, no connections in receiver of Site2 we 
> just started (expected).
>  
> Cluster-2 gfsh>list gateways
> GatewayReceiver Section              Member               | Port | Sender 
> Count | Senders Connected
> ---------------------------------- | ---- | ------------ | 
> ---------------------------------------------------------------------------------------------------------------------------------
> 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 11           | 
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003, 
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:..
> 10.0.2.15(server12:8908)<v3>:41006 | 5182 | 0            |
>  
> Now stop one locator in Site2 - the one that was serving GW senders - it was 
> locator10 in my case. Start GW sender in that server of Site1 again. Check 
> the balance in Site2 GW receiver:
> Cluster-2 gfsh>list gateways
> GatewayReceiver Section              Member               | Port | Sender 
> Count | Senders Connected
> ---------------------------------- | ---- | ------------ | 
> ---------------------------------------------------------------------------------------------------------------------------------
> 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 17           | 
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003, 
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:..
> 10.0.2.15(server12:8908)<v3>:41006 | 5182 | 6            | 
> 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005, 
> 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:..
> As you can see in above printout, connections aren't balanced correctly when 
> connection request is sent to new locator.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (GEODE-10056) Gateway-reciver connection load mantained only on one locator

Reply via email to