[ https://issues.apache.org/jira/browse/GEODE-10056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jakov Varenina updated GEODE-10056: ----------------------------------- Description: Fist problem is that servers send incorrect gateay-receiver connection load to locators with CacheServerLoadMessage. Second problem is that locator doesn't refresh gateway-receivers load per server in local map with the load received in CacheServerLoadMessage. This seems to be a bug, as there is already mechanism to track and store gateway-receiver connection load per server in locator, but that load is never refreshed by fault at the reception of CacheServerLoadMessage. Currently, receiver load is only refreshed/increased on the locator that is handling ClientConnectionRequest\{group=__recv_group...} and ClientConnectionResponse messages from remote server that is trying to establish gateway sender connection. All other locators in cluster will never refresh the gateway-receiver connection load in this case. When locator that was serving remote gateway-senders goes down then new locator will take that job. Problem is that new locator will not have correct load (it was never refreshed) and that would in most situations result with new gateway-sender connections being established in unbalanced way. Way to reproduce the issue: Start 2 clusters, Let's call site1 the sending and site2 the receiving site, The receiving site should have at least 2 locators. Both have 2 servers. No regions are needed. Cluster-1 gfsh>list members Member Count : 3Name | Id --------- | ------------------------------------------------------------- locator10 | 10.0.2.15(locator10:7332:locator)<ec><v0>:41000 [Coordinator] server11 | 10.0.2.15(server11:8358)<v1>:41003 server12 | 10.0.2.15(server12:8717)<v2>:41005 Cluster-2 gfsh>list members Member Count : 4Name | Id --------- | ------------------------------------------------------------- locator10 | 10.0.2.15(locator10:7562:locator)<ec><v0>:41001 [Coordinator] locator11 | 10.0.2.15(locator11:8103:locator)<ec><v1>:41002 server11 | 10.0.2.15(server11:8547)<v2>:41004 server12 | 10.0.2.15(server12:8908)<v3>:41006 Create GW receiver in Site2 on both servers. Cluster-2 gfsh>list gateways GatewayReceiver Section Member | Port | Sender Count | Senders Connected ---------------------------------- | ---- | ------------ | ----------------- 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 0 | 10.0.2.15(server12:8908)<v3>:41006 | 5457 | 0 | Create GW sender in Site1 on both servers. Use 10 dispatcher threads for easier obervation. Cluster-1 gfsh>list gateways GatewaySender SectionGatewaySender Id | Member | Remote Cluster Id | Type | Status | Queued Events | Receiver Location ---------------- | ---------------------------------- | ----------------- | -------- | --------------------- | ------------- | ----------------- senderTo2 | 10.0.2.15(server11:8358)<v1>:41003 | 2 | Parallel | Running and Connected | 0 | 10.0.2.15:5457 senderTo2 | 10.0.2.15(server12:8717)<v2>:41005 | 2 | Parallel | Running and Connected | 0 | 10.0.2.15:5457 Observe balance in GW receiver connections in Site2. It will be perfect. Cluster-2 gfsh>list gateways GatewayReceiver Section Member | Port | Sender Count | Senders Connected ---------------------------------- | ---- | ------------ | --------------------------------------------------------------------------------------------------------------------------------- 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 12 | 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:.. 10.0.2.15(server12:8908)<v3>:41006 | 5457 | 12 | 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:.. 12 connections each - 10 payload + 2 ping connections. Now stop GW receiver in one server of site2. In Site1 do a stop/start gateway-sender command - all connections will go to the only receiver in site2 (as expected). Check it: Cluster-2 gfsh>list gateways GatewayReceiver Section Member | Port | Sender Count | Senders Connected ---------------------------------- | ---- | ------------ | --------------------------------------------------------------------------------------------------------------------------------- 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 22 | 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:.. 10.0.2.15(server12:8908)<v3>:41006 | 5457 | 0 | Now 22 in just one receiver - 20 payload + 1 ping from each sender. Stop GW sender in one server in Site1. Connection drops in GW receiver to half the value (also expected). Cluster-2 gfsh>list gateways GatewayReceiver Section Member | Port | Sender Count | Senders Connected ---------------------------------- | ---- | ------------ | --------------------------------------------------------------------------------------------------------------------------------- 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 11 | 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:.. 10.0.2.15(server12:8908)<v3>:41006 | 5457 | 0 | Now 11 as one sender from Site1 is stopped. Start the GW receiver in server of site2 (that was stopped before). It will not receive new connections just yet. Start GW sender in one server in Site1 (that was stopped before). All connections will land in receiver started before so the balance is there. Cluster-2 gfsh>list gateways GatewayReceiver Section Member | Port | Sender Count | Senders Connected ---------------------------------- | ---- | ------------ | --------------------------------------------------------------------------------------------------------------------------------- 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 11 | 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:.. 10.0.2.15(server12:8908)<v3>:41006 | 5182 | 11 | 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:.. 11 connections in each because we have perfect mapping server11 to server11 and server12 to server12 (i.e. there is just 1 ping connection in each receiver). As expected - we see how balance was achieved. Stop GW sender in same server in Site1 again. Again, no connections in receiver of Site2 we just started (expected). Cluster-2 gfsh>list gateways GatewayReceiver Section Member | Port | Sender Count | Senders Connected ---------------------------------- | ---- | ------------ | --------------------------------------------------------------------------------------------------------------------------------- 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 11 | 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:.. 10.0.2.15(server12:8908)<v3>:41006 | 5182 | 0 | Now stop one locator in Site2 - the one that was serving GW senders - it was locator10 in my case. Start GW sender in that server of Site1 again. Check the balance in Site2 GW receiver: Cluster-2 gfsh>list gateways GatewayReceiver Section Member | Port | Sender Count | Senders Connected ---------------------------------- | ---- | ------------ | --------------------------------------------------------------------------------------------------------------------------------- 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 17 | 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:.. 10.0.2.15(server12:8908)<v3>:41006 | 5182 | 6 | 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:.. As you can see in above printout, connections aren't balanced correctly when connection request is sent to new locator. was: Fist problem is that servers send incorrect gateay-receiver connection load to locators with CacheServerLoadMessage. Second problem is that locator doesn't refresh gateway-receivers load per server in local map with the load received in CacheServerLoadMessage. This seems to be a bug, as there is already mechanism to track and store gateway-receiver connection load per server in locator, but that load is not refreshed by fault at the reception of CacheServerLoadMessage. Currently, receiver load is only refreshed/increased on the locator that is handling ClientConnectionRequest\{group=__recv_group...} and ClientConnectionResponse messages from remote server that is trying to establish gateway sender connection. All other locators in cluster will never refresh the gateway-receiver connection load in this case. When locator that was serving remote gateway-senders goes down then new locator will take that job. Problem is that new locator will not have correct load (it was never refreshed) and that would in most situations result with new gateway-sender connections being established in unbalanced way. Way to reproduce the issue: Start 2 clusters, Let's call site1 the sending and site2 the receiving site, The receiving site should have at least 2 locators. Both have 2 servers. No regions are needed. Cluster-1 gfsh>list members Member Count : 3Name | Id --------- | ------------------------------------------------------------- locator10 | 10.0.2.15(locator10:7332:locator)<ec><v0>:41000 [Coordinator] server11 | 10.0.2.15(server11:8358)<v1>:41003 server12 | 10.0.2.15(server12:8717)<v2>:41005 Cluster-2 gfsh>list members Member Count : 4Name | Id --------- | ------------------------------------------------------------- locator10 | 10.0.2.15(locator10:7562:locator)<ec><v0>:41001 [Coordinator] locator11 | 10.0.2.15(locator11:8103:locator)<ec><v1>:41002 server11 | 10.0.2.15(server11:8547)<v2>:41004 server12 | 10.0.2.15(server12:8908)<v3>:41006 Create GW receiver in Site2 on both servers. Cluster-2 gfsh>list gateways GatewayReceiver Section Member | Port | Sender Count | Senders Connected ---------------------------------- | ---- | ------------ | ----------------- 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 0 | 10.0.2.15(server12:8908)<v3>:41006 | 5457 | 0 | Create GW sender in Site1 on both servers. Use 10 dispatcher threads for easier obervation. Cluster-1 gfsh>list gateways GatewaySender SectionGatewaySender Id | Member | Remote Cluster Id | Type | Status | Queued Events | Receiver Location ---------------- | ---------------------------------- | ----------------- | -------- | --------------------- | ------------- | ----------------- senderTo2 | 10.0.2.15(server11:8358)<v1>:41003 | 2 | Parallel | Running and Connected | 0 | 10.0.2.15:5457 senderTo2 | 10.0.2.15(server12:8717)<v2>:41005 | 2 | Parallel | Running and Connected | 0 | 10.0.2.15:5457 Observe balance in GW receiver connections in Site2. It will be perfect. Cluster-2 gfsh>list gateways GatewayReceiver Section Member | Port | Sender Count | Senders Connected ---------------------------------- | ---- | ------------ | --------------------------------------------------------------------------------------------------------------------------------- 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 12 | 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:.. 10.0.2.15(server12:8908)<v3>:41006 | 5457 | 12 | 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:.. 12 connections each - 10 payload + 2 ping connections. Now stop GW receiver in one server of site2. In Site1 do a stop/start gateway-sender command - all connections will go to the only receiver in site2 (as expected). Check it: Cluster-2 gfsh>list gateways GatewayReceiver Section Member | Port | Sender Count | Senders Connected ---------------------------------- | ---- | ------------ | --------------------------------------------------------------------------------------------------------------------------------- 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 22 | 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:.. 10.0.2.15(server12:8908)<v3>:41006 | 5457 | 0 | Now 22 in just one receiver - 20 payload + 1 ping from each sender. Stop GW sender in one server in Site1. Connection drops in GW receiver to half the value (also expected). Cluster-2 gfsh>list gateways GatewayReceiver Section Member | Port | Sender Count | Senders Connected ---------------------------------- | ---- | ------------ | --------------------------------------------------------------------------------------------------------------------------------- 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 11 | 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:.. 10.0.2.15(server12:8908)<v3>:41006 | 5457 | 0 | Now 11 as one sender from Site1 is stopped. Start the GW receiver in server of site2 (that was stopped before). It will not receive new connections just yet. Start GW sender in one server in Site1 (that was stopped before). All connections will land in receiver started before so the balance is there. Cluster-2 gfsh>list gateways GatewayReceiver Section Member | Port | Sender Count | Senders Connected ---------------------------------- | ---- | ------------ | --------------------------------------------------------------------------------------------------------------------------------- 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 11 | 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:.. 10.0.2.15(server12:8908)<v3>:41006 | 5182 | 11 | 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:.. 11 connections in each because we have perfect mapping server11 to server11 and server12 to server12 (i.e. there is just 1 ping connection in each receiver). As expected - we see how balance was achieved. Stop GW sender in same server in Site1 again. Again, no connections in receiver of Site2 we just started (expected). Cluster-2 gfsh>list gateways GatewayReceiver Section Member | Port | Sender Count | Senders Connected ---------------------------------- | ---- | ------------ | --------------------------------------------------------------------------------------------------------------------------------- 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 11 | 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:.. 10.0.2.15(server12:8908)<v3>:41006 | 5182 | 0 | Now stop one locator in Site2 - the one that was serving GW senders - it was locator10 in my case. Start GW sender in that server of Site1 again. Check the balance in Site2 GW receiver: Cluster-2 gfsh>list gateways GatewayReceiver Section Member | Port | Sender Count | Senders Connected ---------------------------------- | ---- | ------------ | --------------------------------------------------------------------------------------------------------------------------------- 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 17 | 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:.. 10.0.2.15(server12:8908)<v3>:41006 | 5182 | 6 | 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:.. As you can see in above printout, connections aren't balanced correctly when connection request is sent to new locator. > Gateway-reciver connection load mantained only on one locator > ------------------------------------------------------------- > > Key: GEODE-10056 > URL: https://issues.apache.org/jira/browse/GEODE-10056 > Project: Geode > Issue Type: Bug > Reporter: Jakov Varenina > Assignee: Jakov Varenina > Priority: Major > Labels: needsTriage > > Fist problem is that servers send incorrect gateay-receiver connection load > to locators with CacheServerLoadMessage. Second problem is that locator > doesn't refresh gateway-receivers load per server in local map with the load > received in CacheServerLoadMessage. This seems to be a bug, as there is > already mechanism to track and store gateway-receiver connection load per > server in locator, but that load is never refreshed by fault at the reception > of CacheServerLoadMessage. Currently, receiver load is only > refreshed/increased on the locator that is handling > ClientConnectionRequest\{group=__recv_group...} and ClientConnectionResponse > messages from remote server that is trying to establish gateway sender > connection. All other locators in cluster will never refresh the > gateway-receiver connection load in this case. When locator that was serving > remote gateway-senders goes down then new locator will take that job. Problem > is that new locator will not have correct load (it was never refreshed) and > that would in most situations result with new gateway-sender connections > being established in unbalanced way. > Way to reproduce the issue: > Start 2 clusters, Let's call site1 the sending and site2 the receiving site, > The receiving site should have at least 2 locators. Both have 2 servers. No > regions are needed. > Cluster-1 gfsh>list members > Member Count : 3Name | Id > --------- | ------------------------------------------------------------- > locator10 | 10.0.2.15(locator10:7332:locator)<ec><v0>:41000 [Coordinator] > server11 | 10.0.2.15(server11:8358)<v1>:41003 > server12 | 10.0.2.15(server12:8717)<v2>:41005 > > Cluster-2 gfsh>list members > Member Count : 4Name | Id > --------- | ------------------------------------------------------------- > locator10 | 10.0.2.15(locator10:7562:locator)<ec><v0>:41001 [Coordinator] > locator11 | 10.0.2.15(locator11:8103:locator)<ec><v1>:41002 > server11 | 10.0.2.15(server11:8547)<v2>:41004 > server12 | 10.0.2.15(server12:8908)<v3>:41006 > > Create GW receiver in Site2 on both servers. > Cluster-2 gfsh>list gateways > GatewayReceiver Section Member | Port | Sender > Count | Senders Connected > ---------------------------------- | ---- | ------------ | ----------------- > 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 0 | > 10.0.2.15(server12:8908)<v3>:41006 | 5457 | 0 | > Create GW sender in Site1 on both servers. Use 10 dispatcher threads for > easier obervation. > Cluster-1 gfsh>list gateways > GatewaySender SectionGatewaySender Id | Member | > Remote Cluster Id | Type | Status | Queued Events | > Receiver Location > ---------------- | ---------------------------------- | ----------------- | > -------- | --------------------- | ------------- | ----------------- > senderTo2 | 10.0.2.15(server11:8358)<v1>:41003 | 2 | > Parallel | Running and Connected | 0 | 10.0.2.15:5457 > senderTo2 | 10.0.2.15(server12:8717)<v2>:41005 | 2 | > Parallel | Running and Connected | 0 | 10.0.2.15:5457 > > Observe balance in GW receiver connections in Site2. It will be perfect. > > Cluster-2 gfsh>list gateways > GatewayReceiver Section Member | Port | Sender > Count | Senders Connected > ---------------------------------- | ---- | ------------ | > --------------------------------------------------------------------------------------------------------------------------------- > 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 12 | > 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005, > 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:.. > 10.0.2.15(server12:8908)<v3>:41006 | 5457 | 12 | > 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005, > 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:.. > > 12 connections each - 10 payload + 2 ping connections. > Now stop GW receiver in one server of site2. In Site1 do a stop/start > gateway-sender command - all connections will go to the only receiver in > site2 (as expected). Check it: > > Cluster-2 gfsh>list gateways > GatewayReceiver Section Member | Port | Sender > Count | Senders Connected > ---------------------------------- | ---- | ------------ | > --------------------------------------------------------------------------------------------------------------------------------- > 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 22 | > 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server12:8717)<v2>:41005, > 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:.. > 10.0.2.15(server12:8908)<v3>:41006 | 5457 | 0 | > > Now 22 in just one receiver - 20 payload + 1 ping from each sender. > Stop GW sender in one server in Site1. Connection drops in GW receiver to > half the value (also expected). > > Cluster-2 gfsh>list gateways > GatewayReceiver Section Member | Port | Sender > Count | Senders Connected > ---------------------------------- | ---- | ------------ | > --------------------------------------------------------------------------------------------------------------------------------- > 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 11 | > 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003, > 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:.. > 10.0.2.15(server12:8908)<v3>:41006 | 5457 | 0 | > Now 11 as one sender from Site1 is stopped. > Start the GW receiver in server of site2 (that was stopped before). It will > not receive new connections just yet. > Start GW sender in one server in Site1 (that was stopped before). All > connections will land in receiver started before so the balance is there. > Cluster-2 gfsh>list gateways > GatewayReceiver Section Member | Port | Sender > Count | Senders Connected > ---------------------------------- | ---- | ------------ | > --------------------------------------------------------------------------------------------------------------------------------- > 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 11 | > 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003, > 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:.. > 10.0.2.15(server12:8908)<v3>:41006 | 5182 | 11 | > 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005, > 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:.. > 11 connections in each because we have perfect mapping server11 to server11 > and server12 to server12 (i.e. there is just 1 ping connection in each > receiver). As expected - we see how balance was achieved. Stop GW sender in > same server in Site1 again. Again, no connections in receiver of Site2 we > just started (expected). > Cluster-2 gfsh>list gateways > GatewayReceiver Section Member | Port | Sender > Count | Senders Connected > ---------------------------------- | ---- | ------------ | > --------------------------------------------------------------------------------------------------------------------------------- > 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 11 | > 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003, > 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:.. > 10.0.2.15(server12:8908)<v3>:41006 | 5182 | 0 | > Now stop one locator in Site2 - the one that was serving GW senders - it was > locator10 in my case. Start GW sender in that server of Site1 again. Check > the balance in Site2 GW receiver: > Cluster-2 gfsh>list gateways > GatewayReceiver Section Member | Port | Sender > Count | Senders Connected > ---------------------------------- | ---- | ------------ | > --------------------------------------------------------------------------------------------------------------------------------- > 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 17 | > 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003, > 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:.. > 10.0.2.15(server12:8908)<v3>:41006 | 5182 | 6 | > 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005, > 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:.. > As you can see in above printout, connections aren't balanced correctly when > connection request is sent to new locator. -- This message was sent by Atlassian Jira (v8.20.1#820001)