[
https://issues.apache.org/jira/browse/GEODE-10056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jakov Varenina updated GEODE-10056:
-----------------------------------
Description:
When GW sender wants to create connection to a receiver, it will ask remote
locator where to connect to (which server) using CLIENT_CONNECTION_REQUEST
message. Locator should check the load (actually just the connection count in
each GW receiver) and respond with least loaded server.
But, servers do not track the load for their GW receiver acceptor! It is always
0. What happens then?
It looks like each locator is mantaining a map of the load based on connections
it dealt around so there will be no unbalancing problems until either locator
restarts or clients get their connections from some other locator in the
cluster. Both are quite valid scenarios in my opinion and the net-result is
unbalance in replication connections.
Start 2 clusters, Let's call site1 the sending and site2 the receiving site,
The receiving site should have at least 2 locators. Both have 2 servers. No
regions are needed.
Cluster-1 gfsh>list members
Member Count : 3Name | Id
--------- | -------------------------------------------------------------
locator10 | 10.0.2.15(locator10:7332:locator)<ec><v0>:41000 [Coordinator]
server11 | 10.0.2.15(server11:8358)<v1>:41003
server12 | 10.0.2.15(server12:8717)<v2>:41005
Cluster-2 gfsh>list members
Member Count : 4Name | Id
--------- | -------------------------------------------------------------
locator10 | 10.0.2.15(locator10:7562:locator)<ec><v0>:41001 [Coordinator]
locator11 | 10.0.2.15(locator11:8103:locator)<ec><v1>:41002
server11 | 10.0.2.15(server11:8547)<v2>:41004
server12 | 10.0.2.15(server12:8908)<v3>:41006
Create GW receiver in Site2 on both servers.
Cluster-2 gfsh>list gateways
GatewayReceiver Section Member | Port | Sender Count
| Senders Connected
---------------------------------- | ---- | ------------ | -----------------
10.0.2.15(server11:8547)<v2>:41004 | 5175 | 0 |
10.0.2.15(server12:8908)<v3>:41006 | 5457 | 0 |
Create GW sender in Site1 on both servers. Use 10 dispatcher threads for easier
obervation.
Cluster-1 gfsh>list gateways
GatewaySender SectionGatewaySender Id | Member |
Remote Cluster Id | Type | Status | Queued Events | Receiver
Location
---------------- | ---------------------------------- | ----------------- |
-------- | --------------------- | ------------- | -----------------
senderTo2 | 10.0.2.15(server11:8358)<v1>:41003 | 2 |
Parallel | Running and Connected | 0 | 10.0.2.15:5457
senderTo2 | 10.0.2.15(server12:8717)<v2>:41005 | 2 |
Parallel | Running and Connected | 0 | 10.0.2.15:5457
Observe balance in GW receiver connections in Site2. It will be perfect.
Cluster-2 gfsh>list gateways
GatewayReceiver Section Member | Port | Sender Count
| Senders Connected
---------------------------------- | ---- | ------------ |
---------------------------------------------------------------------------------------------------------------------------------
10.0.2.15(server11:8547)<v2>:41004 | 5175 | 12 |
10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005,
10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:..
10.0.2.15(server12:8908)<v3>:41006 | 5457 | 12 |
10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005,
10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:..
12 connections each - 10 payload + 2 ping connections.
Now stop GW receiver in one server of site2. In Site1 do a stop/start
gateway-sender command - all connections will go to the only receiver in site2
(as expected). Check it:
Cluster-2 gfsh>list gateways
GatewayReceiver Section Member | Port | Sender Count
| Senders Connected
---------------------------------- | ---- | ------------ |
---------------------------------------------------------------------------------------------------------------------------------
10.0.2.15(server11:8547)<v2>:41004 | 5175 | 22 |
10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server12:8717)<v2>:41005,
10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:..
10.0.2.15(server12:8908)<v3>:41006 | 5457 | 0 |
Now 22 in just one receiver - 20 payload + 1 ping from each sender.
Stop GW sender in one server in Site1. Connection drops in GW receiver to half
the value (also expected).
Cluster-2 gfsh>list gateways
GatewayReceiver Section Member | Port | Sender Count
| Senders Connected
---------------------------------- | ---- | ------------ |
---------------------------------------------------------------------------------------------------------------------------------
10.0.2.15(server11:8547)<v2>:41004 | 5175 | 11 |
10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003,
10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:..
10.0.2.15(server12:8908)<v3>:41006 | 5457 | 0 |
Now 11 as one sender from Site1 is stopped.
Start the GW receiver in server of site2 (that was stopped before). It will not
receive new connections just yet.
Start GW sender in one server in Site1 (that was stopped before). All
connections will land in receiver started before so the balance is there.
Cluster-2 gfsh>list gateways
GatewayReceiver Section Member | Port | Sender Count
| Senders Connected
---------------------------------- | ---- | ------------ |
---------------------------------------------------------------------------------------------------------------------------------
10.0.2.15(server11:8547)<v2>:41004 | 5175 | 11 |
10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003,
10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:..
10.0.2.15(server12:8908)<v3>:41006 | 5182 | 11 |
10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005,
10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:..
11 connections in each because we have perfect mapping server11 to server11 and
server12 to server12 (i.e. there is just 1 ping connection in each receiver).
As expected - we see how balance was achieved. Stop GW sender in same server in
Site1 again. Again, no connections in receiver of Site2 we just started
(expected).
Cluster-2 gfsh>list gateways
GatewayReceiver Section Member | Port | Sender Count
| Senders Connected
---------------------------------- | ---- | ------------ |
---------------------------------------------------------------------------------------------------------------------------------
10.0.2.15(server11:8547)<v2>:41004 | 5175 | 11 |
10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003,
10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:..
10.0.2.15(server12:8908)<v3>:41006 | 5182 | 0 |
Now stop one locator in Site2 - the one that was serving GW senders - it was
locator10 in my case. Start GW sender in that server of Site1 again. Check the
balance in Site2 GW receiver:
Cluster-2 gfsh>list gateways
GatewayReceiver Section Member | Port | Sender Count
| Senders Connected
---------------------------------- | ---- | ------------ |
---------------------------------------------------------------------------------------------------------------------------------
10.0.2.15(server11:8547)<v2>:41004 | 5175 | 17 |
10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003,
10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:..
10.0.2.15(server12:8908)<v3>:41006 | 5182 | 6 |
10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005,
10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:..
As you can see in above printout, connections aren't balanced correctly when
connection request is sent to new locator.
was:
When GW sender wants to create connection to a receiver, it will ask remote
locator where to connect to (which server) using CLIENT_CONNECTION_REQUEST
message. Locator should check the load (actually just the connection count in
each GW receiver) and respond with least loaded server.
But, servers do not track the load for their GW receiver acceptor! It is always
0. What happens then?
It looks like each locator is mantaining a map of the load based on connections
it dealt around so there will be no unbalancing problems until either locator
restarts or clients get their connections from some other locator in the
cluster. Both are quite valid scenarios in my opinion and the net-result is
unbalance in replication connections.
How to test?
How to test?
Start 2 clusters, Let's call site1 the sending and site2 the receiving site,
The receiving site should have at least 2 locators. Both have 2 servers. No
regions are needed.
Cluster-1 gfsh>list members
Member Count : 3Name | Id
--------- | -------------------------------------------------------------
locator10 | 10.0.2.15(locator10:7332:locator)<ec><v0>:41000 [Coordinator]
server11 | 10.0.2.15(server11:8358)<v1>:41003
server12 | 10.0.2.15(server12:8717)<v2>:41005
Cluster-2 gfsh>list members
Member Count : 4Name | Id
--------- | -------------------------------------------------------------
locator10 | 10.0.2.15(locator10:7562:locator)<ec><v0>:41001 [Coordinator]
locator11 | 10.0.2.15(locator11:8103:locator)<ec><v1>:41002
server11 | 10.0.2.15(server11:8547)<v2>:41004
server12 | 10.0.2.15(server12:8908)<v3>:41006
Create GW receiver in Site2 on both servers.
Cluster-2 gfsh>list gateways
GatewayReceiver Section Member | Port | Sender Count
| Senders Connected
---------------------------------- | ---- | ------------ | -----------------
10.0.2.15(server11:8547)<v2>:41004 | 5175 | 0 |
10.0.2.15(server12:8908)<v3>:41006 | 5457 | 0 |
Create GW sender in Site1 on both servers. Use 10 dispatcher threads for easier
obervation.
Cluster-1 gfsh>list gateways
GatewaySender SectionGatewaySender Id | Member |
Remote Cluster Id | Type | Status | Queued Events | Receiver
Location
---------------- | ---------------------------------- | ----------------- |
-------- | --------------------- | ------------- | -----------------
senderTo2 | 10.0.2.15(server11:8358)<v1>:41003 | 2 |
Parallel | Running and Connected | 0 | 10.0.2.15:5457
senderTo2 | 10.0.2.15(server12:8717)<v2>:41005 | 2 |
Parallel | Running and Connected | 0 | 10.0.2.15:5457
Observe balance in GW receiver connections in Site2. It will be perfect.
Cluster-2 gfsh>list gateways
GatewayReceiver Section Member | Port | Sender Count
| Senders Connected
---------------------------------- | ---- | ------------ |
---------------------------------------------------------------------------------------------------------------------------------
10.0.2.15(server11:8547)<v2>:41004 | 5175 | 12 |
10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005,
10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:..
10.0.2.15(server12:8908)<v3>:41006 | 5457 | 12 |
10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005,
10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:..
12 connections each - 10 payload + 2 ping connections.
Now stop GW receiver in one server of site2. In Site1 do a stop/start
gateway-sender command - all connections will go to the only receiver in site2
(as expected). Check it:
Cluster-2 gfsh>list gateways
GatewayReceiver Section Member | Port | Sender Count
| Senders Connected
---------------------------------- | ---- | ------------ |
---------------------------------------------------------------------------------------------------------------------------------
10.0.2.15(server11:8547)<v2>:41004 | 5175 | 22 |
10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server12:8717)<v2>:41005,
10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:..
10.0.2.15(server12:8908)<v3>:41006 | 5457 | 0 |
Now 22 in just one receiver - 20 payload + 1 ping from each sender.
Stop GW sender in one server in Site1. Connection drops in GW receiver to half
the value (also expected).
Cluster-2 gfsh>list gateways
GatewayReceiver Section Member | Port | Sender Count
| Senders Connected
---------------------------------- | ---- | ------------ |
---------------------------------------------------------------------------------------------------------------------------------
10.0.2.15(server11:8547)<v2>:41004 | 5175 | 11 |
10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003,
10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:..
10.0.2.15(server12:8908)<v3>:41006 | 5457 | 0 |
Now 11 as one sender from Site1 is stopped.
Start the GW receiver in server of site2 (that was stopped before). It will not
receive new connections just yet.
Start GW sender in one server in Site1 (that was stopped before). All
connections will land in receiver started before so the balance is there.
Cluster-2 gfsh>list gateways
GatewayReceiver Section Member | Port | Sender Count
| Senders Connected
---------------------------------- | ---- | ------------ |
---------------------------------------------------------------------------------------------------------------------------------
10.0.2.15(server11:8547)<v2>:41004 | 5175 | 11 |
10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003,
10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:..
10.0.2.15(server12:8908)<v3>:41006 | 5182 | 11 |
10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005,
10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:..
11 connections in each because we have perfect mapping server11 to server11 and
server12 to server12 (i.e. there is just 1 ping connection in each receiver).
As expected - we see how balance was achieved. Stop GW sender in same server in
Site1 again. Again, no connections in receiver of Site2 we just started
(expected).
Cluster-2 gfsh>list gateways
GatewayReceiver Section Member | Port | Sender Count
| Senders Connected
---------------------------------- | ---- | ------------ |
---------------------------------------------------------------------------------------------------------------------------------
10.0.2.15(server11:8547)<v2>:41004 | 5175 | 11 |
10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003,
10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:..
10.0.2.15(server12:8908)<v3>:41006 | 5182 | 0 |
Now stop one locator in Site2 - the one that was serving GW senders - it was
locator10 in my case. Start GW sender in that server of Site1 again. Check the
balance in Site2 GW receiver:
Cluster-2 gfsh>list gateways
GatewayReceiver Section Member | Port | Sender Count
| Senders Connected
---------------------------------- | ---- | ------------ |
---------------------------------------------------------------------------------------------------------------------------------
10.0.2.15(server11:8547)<v2>:41004 | 5175 | 17 |
10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003,
10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:..
10.0.2.15(server12:8908)<v3>:41006 | 5182 | 6 |
10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005,
10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:..
As you can see in above printout, connections aren't balanced correctly when
connection request is sent to new locator.
> Gateway-reciver connection load mantained only on one locator
> -------------------------------------------------------------
>
> Key: GEODE-10056
> URL: https://issues.apache.org/jira/browse/GEODE-10056
> Project: Geode
> Issue Type: Bug
> Reporter: Jakov Varenina
> Assignee: Jakov Varenina
> Priority: Major
> Labels: needsTriage
>
> When GW sender wants to create connection to a receiver, it will ask remote
> locator where to connect to (which server) using CLIENT_CONNECTION_REQUEST
> message. Locator should check the load (actually just the connection count in
> each GW receiver) and respond with least loaded server.
> But, servers do not track the load for their GW receiver acceptor! It is
> always 0. What happens then?
> It looks like each locator is mantaining a map of the load based on
> connections it dealt around so there will be no unbalancing problems until
> either locator restarts or clients get their connections from some other
> locator in the cluster. Both are quite valid scenarios in my opinion and the
> net-result is unbalance in replication connections.
> Start 2 clusters, Let's call site1 the sending and site2 the receiving site,
> The receiving site should have at least 2 locators. Both have 2 servers. No
> regions are needed.
> Cluster-1 gfsh>list members
> Member Count : 3Name | Id
> --------- | -------------------------------------------------------------
> locator10 | 10.0.2.15(locator10:7332:locator)<ec><v0>:41000 [Coordinator]
> server11 | 10.0.2.15(server11:8358)<v1>:41003
> server12 | 10.0.2.15(server12:8717)<v2>:41005
>
> Cluster-2 gfsh>list members
> Member Count : 4Name | Id
> --------- | -------------------------------------------------------------
> locator10 | 10.0.2.15(locator10:7562:locator)<ec><v0>:41001 [Coordinator]
> locator11 | 10.0.2.15(locator11:8103:locator)<ec><v1>:41002
> server11 | 10.0.2.15(server11:8547)<v2>:41004
> server12 | 10.0.2.15(server12:8908)<v3>:41006
>
> Create GW receiver in Site2 on both servers.
> Cluster-2 gfsh>list gateways
> GatewayReceiver Section Member | Port | Sender
> Count | Senders Connected
> ---------------------------------- | ---- | ------------ | -----------------
> 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 0 |
> 10.0.2.15(server12:8908)<v3>:41006 | 5457 | 0 |
> Create GW sender in Site1 on both servers. Use 10 dispatcher threads for
> easier obervation.
> Cluster-1 gfsh>list gateways
> GatewaySender SectionGatewaySender Id | Member |
> Remote Cluster Id | Type | Status | Queued Events |
> Receiver Location
> ---------------- | ---------------------------------- | ----------------- |
> -------- | --------------------- | ------------- | -----------------
> senderTo2 | 10.0.2.15(server11:8358)<v1>:41003 | 2 |
> Parallel | Running and Connected | 0 | 10.0.2.15:5457
> senderTo2 | 10.0.2.15(server12:8717)<v2>:41005 | 2 |
> Parallel | Running and Connected | 0 | 10.0.2.15:5457
>
> Observe balance in GW receiver connections in Site2. It will be perfect.
>
> Cluster-2 gfsh>list gateways
> GatewayReceiver Section Member | Port | Sender
> Count | Senders Connected
> ---------------------------------- | ---- | ------------ |
> ---------------------------------------------------------------------------------------------------------------------------------
> 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 12 |
> 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005,
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:..
> 10.0.2.15(server12:8908)<v3>:41006 | 5457 | 12 |
> 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005,
> 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:..
>
> 12 connections each - 10 payload + 2 ping connections.
> Now stop GW receiver in one server of site2. In Site1 do a stop/start
> gateway-sender command - all connections will go to the only receiver in
> site2 (as expected). Check it:
>
> Cluster-2 gfsh>list gateways
> GatewayReceiver Section Member | Port | Sender
> Count | Senders Connected
> ---------------------------------- | ---- | ------------ |
> ---------------------------------------------------------------------------------------------------------------------------------
> 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 22 |
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server12:8717)<v2>:41005,
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:..
> 10.0.2.15(server12:8908)<v3>:41006 | 5457 | 0 |
>
> Now 22 in just one receiver - 20 payload + 1 ping from each sender.
> Stop GW sender in one server in Site1. Connection drops in GW receiver to
> half the value (also expected).
>
> Cluster-2 gfsh>list gateways
> GatewayReceiver Section Member | Port | Sender
> Count | Senders Connected
> ---------------------------------- | ---- | ------------ |
> ---------------------------------------------------------------------------------------------------------------------------------
> 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 11 |
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003,
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:..
> 10.0.2.15(server12:8908)<v3>:41006 | 5457 | 0 |
> Now 11 as one sender from Site1 is stopped.
> Start the GW receiver in server of site2 (that was stopped before). It will
> not receive new connections just yet.
> Start GW sender in one server in Site1 (that was stopped before). All
> connections will land in receiver started before so the balance is there.
> Cluster-2 gfsh>list gateways
> GatewayReceiver Section Member | Port | Sender
> Count | Senders Connected
> ---------------------------------- | ---- | ------------ |
> ---------------------------------------------------------------------------------------------------------------------------------
> 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 11 |
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003,
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:..
> 10.0.2.15(server12:8908)<v3>:41006 | 5182 | 11 |
> 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005,
> 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:..
> 11 connections in each because we have perfect mapping server11 to server11
> and server12 to server12 (i.e. there is just 1 ping connection in each
> receiver). As expected - we see how balance was achieved. Stop GW sender in
> same server in Site1 again. Again, no connections in receiver of Site2 we
> just started (expected).
> Cluster-2 gfsh>list gateways
> GatewayReceiver Section Member | Port | Sender
> Count | Senders Connected
> ---------------------------------- | ---- | ------------ |
> ---------------------------------------------------------------------------------------------------------------------------------
> 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 11 |
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003,
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:..
> 10.0.2.15(server12:8908)<v3>:41006 | 5182 | 0 |
> Now stop one locator in Site2 - the one that was serving GW senders - it was
> locator10 in my case. Start GW sender in that server of Site1 again. Check
> the balance in Site2 GW receiver:
> Cluster-2 gfsh>list gateways
> GatewayReceiver Section Member | Port | Sender
> Count | Senders Connected
> ---------------------------------- | ---- | ------------ |
> ---------------------------------------------------------------------------------------------------------------------------------
> 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 17 |
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003,
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:..
> 10.0.2.15(server12:8908)<v3>:41006 | 5182 | 6 |
> 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005,
> 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:..
> As you can see in above printout, connections aren't balanced correctly when
> connection request is sent to new locator.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)