[ https://issues.apache.org/jira/browse/GEODE-10056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jakov Varenina updated GEODE-10056: ----------------------------------- Summary: Gateway-reciver connection load mantained only on one locator (was: Gateway-reciver load mantained only on one locator) > Gateway-reciver connection load mantained only on one locator > ------------------------------------------------------------- > > Key: GEODE-10056 > URL: https://issues.apache.org/jira/browse/GEODE-10056 > Project: Geode > Issue Type: Bug > Reporter: Jakov Varenina > Assignee: Jakov Varenina > Priority: Major > Labels: needsTriage > > When GW sender wants to create connection to a receiver, it will ask remote > locator where to connect to (which server) using CLIENT_CONNECTION_REQUEST > message. Locator should check the load (actually just the connection count in > each GW receiver) and respond with least loaded server. > But, servers do not track the load for their GW receiver acceptor! It is > always 0. What happens then? > It looks like each locator is mantaining a map of the load based on > connections it dealt around so there will be no unbalancing problems until > either locator restarts or clients get their connections from some other > locator in the cluster. Both are quite valid scenarios in my opinion and the > net-result is unbalance in replication connections. > How to test? > How to test? > Start 2 clusters, Let's call site1 the sending and site2 the receiving site, > The receiving site should have at least 2 locators. Both have 2 servers. No > regions are needed. > Cluster-1 gfsh>list members > Member Count : 3Name | Id > --------- | ------------------------------------------------------------- > locator10 | 10.0.2.15(locator10:7332:locator)<ec><v0>:41000 [Coordinator] > server11 | 10.0.2.15(server11:8358)<v1>:41003 > server12 | 10.0.2.15(server12:8717)<v2>:41005 > > Cluster-2 gfsh>list members > Member Count : 4Name | Id > --------- | ------------------------------------------------------------- > locator10 | 10.0.2.15(locator10:7562:locator)<ec><v0>:41001 [Coordinator] > locator11 | 10.0.2.15(locator11:8103:locator)<ec><v1>:41002 > server11 | 10.0.2.15(server11:8547)<v2>:41004 > server12 | 10.0.2.15(server12:8908)<v3>:41006 > > Create GW receiver in Site2 on both servers. > Cluster-2 gfsh>list gateways > GatewayReceiver Section Member | Port | Sender > Count | Senders Connected > ---------------------------------- | ---- | ------------ | ----------------- > 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 0 | > 10.0.2.15(server12:8908)<v3>:41006 | 5457 | 0 | > Create GW sender in Site1 on both servers. Use 10 dispatcher threads for > easier obervation. > Cluster-1 gfsh>list gateways > GatewaySender SectionGatewaySender Id | Member | > Remote Cluster Id | Type | Status | Queued Events | > Receiver Location > ---------------- | ---------------------------------- | ----------------- | > -------- | --------------------- | ------------- | ----------------- > senderTo2 | 10.0.2.15(server11:8358)<v1>:41003 | 2 | > Parallel | Running and Connected | 0 | 10.0.2.15:5457 > senderTo2 | 10.0.2.15(server12:8717)<v2>:41005 | 2 | > Parallel | Running and Connected | 0 | 10.0.2.15:5457 > > Observe balance in GW receiver connections in Site2. It will be perfect. > > Cluster-2 gfsh>list gateways > GatewayReceiver Section Member | Port | Sender > Count | Senders Connected > ---------------------------------- | ---- | ------------ | > --------------------------------------------------------------------------------------------------------------------------------- > 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 12 | > 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005, > 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:.. > 10.0.2.15(server12:8908)<v3>:41006 | 5457 | 12 | > 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005, > 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:.. > > 12 connections each - 10 payload + 2 ping connections. > Now stop GW receiver in one server of site2. In Site1 do a stop/start > gateway-sender command - all connections will go to the only receiver in > site2 (as expected). Check it: > > Cluster-2 gfsh>list gateways > GatewayReceiver Section Member | Port | Sender > Count | Senders Connected > ---------------------------------- | ---- | ------------ | > --------------------------------------------------------------------------------------------------------------------------------- > 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 22 | > 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server12:8717)<v2>:41005, > 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:.. > 10.0.2.15(server12:8908)<v3>:41006 | 5457 | 0 | > > Now 22 in just one receiver - 20 payload + 1 ping from each sender. > Stop GW sender in one server in Site1. Connection drops in GW receiver to > half the value (also expected). > > Cluster-2 gfsh>list gateways > GatewayReceiver Section Member | Port | Sender > Count | Senders Connected > ---------------------------------- | ---- | ------------ | > --------------------------------------------------------------------------------------------------------------------------------- > 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 11 | > 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003, > 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:.. > 10.0.2.15(server12:8908)<v3>:41006 | 5457 | 0 | > Now 11 as one sender from Site1 is stopped. > Start the GW receiver in server of site2 (that was stopped before). It will > not receive new connections just yet. > Start GW sender in one server in Site1 (that was stopped before). All > connections will land in receiver started before so the balance is there. > Cluster-2 gfsh>list gateways > GatewayReceiver Section Member | Port | Sender > Count | Senders Connected > ---------------------------------- | ---- | ------------ | > --------------------------------------------------------------------------------------------------------------------------------- > 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 11 | > 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003, > 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:.. > 10.0.2.15(server12:8908)<v3>:41006 | 5182 | 11 | > 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005, > 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:.. > 11 connections in each because we have perfect mapping server11 to server11 > and server12 to server12 (i.e. there is just 1 ping connection in each > receiver). As expected - we see how balance was achieved. Stop GW sender in > same server in Site1 again. Again, no connections in receiver of Site2 we > just started (expected). > Cluster-2 gfsh>list gateways > GatewayReceiver Section Member | Port | Sender > Count | Senders Connected > ---------------------------------- | ---- | ------------ | > --------------------------------------------------------------------------------------------------------------------------------- > 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 11 | > 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003, > 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:.. > 10.0.2.15(server12:8908)<v3>:41006 | 5182 | 0 | > Now stop one locator in Site2 - the one that was serving GW senders - it was > locator10 in my case. Start GW sender in that server of Site1 again. Check > the balance in Site2 GW receiver: > Cluster-2 gfsh>list gateways > GatewayReceiver Section Member | Port | Sender > Count | Senders Connected > ---------------------------------- | ---- | ------------ | > --------------------------------------------------------------------------------------------------------------------------------- > 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 17 | > 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003, > 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:.. > 10.0.2.15(server12:8908)<v3>:41006 | 5182 | 6 | > 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005, > 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:.. > As you can see in above printout, connections aren't balanced correctly when > connection request is sent to new locator. -- This message was sent by Atlassian Jira (v8.20.1#820001)