Hi Barry,

Thank you for the reply and detailed analysis!


You are correct that I misunderstood the part about coordinator locator behavior; sorry about that. The client receives a list of locators in the RemoteLocatorJoinResponse message. Then client loops (from first to the last) through that locator list until it successfully sends the ClientConnectionRequest message. So that means the client will actually send connection requests to the first available locator, which doesn't have to be a coordinator.


Still, in normal conditions, all connection requests will be handled by the same locator (the first one in the list of locators).


Regarding PR, I tried to align gateway-receivers connection load handling with the client connections load handling on the locator. But I have encountered one race condition that I don't know how to solve, which I explained in this comment: https://github.com/apache/geode/pull/7378#issuecomment-1048513322 <https://github.com/apache/geode/pull/7378#issuecomment-1048513322>


Thanks,

Jakov


On 12. 03. 2022. 02:33, Barry Oglesby wrote:
Jakov,

I'm not sure about the coordinator / non-coordinator behavior you're 
describing, but I see the other behavior. It doesn't seem quite right.

Here is a detailed explanation of what I see.

LocatorLoadSnapshot.getServerForConnection increments the LoadHolder's 
connections which updates its load. In the receiver's case, 
getServerForConnection is called by the remote sender to get a server to 
connect to (as you said using a ClientConnectionRequest).

Normally, the LoadHolder's load is updated when load is received in 
LocatorLoadSnapshot.updateMap (also as you said using a 
CacheServerLoadMessage). This doesn't happen in the receiver's case.

Locator behavior:

When the receiver connects, it sends its profile to the locator which causes 
its LoadHolder to be added to the __recv__group map. It is not added to the 
null group map. Thats the map for normal servers that have no group. Based on 
current implementation, it can't be added to this map or it would be used for 
normal local (to the receiver) client connections.

LocatorLoadSnapshot.addServer location=192.168.1.5:5409; groups=[__recv__group]
LocatorLoadSnapshot.addGroups group=__recv__group; location=192.168.1.5:5409
LocatorLoadSnapshot.addGroups not adding to the null map group=__recv__group; 
location=192.168.1.5:5409

When the load is received for the receiver, it is ignored. updateMap gets the 
LoadHolder from the null group map. Since that map was not updated for the 
receiver, there is no entry for it (holder=null below).

LocatorLoadSnapshot.updateLoad about to update connectionLoadMap 
location=192.168.1.5:5409
LocatorLoadSnapshot.updateMap location=192.168.1.5:5409; load=0.0; 
loadPerConnection=0.00125
LocatorLoadSnapshot.updateMap location=192.168.1.5:5409; load=0.0; 
loadPerConnection=0.00125; holder=null
LocatorLoadSnapshot.updateLoad ignoring load location=192.168.1.5:5409
LocatorLoadSnapshot.updateLoad done update connectionLoadMap 
location=192.168.1.5:5409

So, load is not updated in this way for a receiver.

When a request for a remote receiver is received, it uses the __recv__group 
load to provide that server. It also increments the load to that server (in 
LoadHolder.incConnections). This is how load is updated for a receiver.

LocatorLoadSnapshot.getServerForConnection group=__recv__group
LocatorLoadSnapshot.getServerForConnection group=__recv__group; 
potentialServers={192.168.1.5:5409@192.168.1.5(ln-1:81083)<v1>:41002=LoadHolder[0.0,
 192.168.1.5:5409, loadPollInterval=5000, 0.00125]}
LoadHolder.incConnections location=192.168.1.5:5409; load=0.00125
LocatorLoadSnapshot.getServerForConnection group=__recv__group; 
usingServer=192.168.1.5:5409

Receiver server behavior:

When a receiver gets a new connection, the ServerConnection.processHandShake 
updates the LoadMonitor. The LoadMonitor explicitly does not update the 
connection count because isClientOperations=false, so the load is never changed 
on the server. This is interesting behavior. I'm not sure if it wasn't updated 
because it was unnecessary given the behavior of the locator above.

ServerConnection.processHandShake about to update the LoadMonitor 
communicationMode=gateway
LoadMonitor.connectionOpened isClientOperations=false; 
type=GatewayReceiverStatistics
LoadMonitor.connectionOpened did not increment connectionCount=0; 
type=GatewayReceiverStatistics
ServerConnection.processHandShake done update the LoadMonitor

The LoadMonitor on the server does send the load periodically to the locator but 
only because skippedLoadUpdates>forceUpdateFrequency (which is 10 times through 
the polling loop by default):

PollingThread.run got load type=GatewayReceiverStatistics; load=Load(0.0, 
0.00125, 0.0, 1.0)
PollingThread.run forceUpdateFrequency=true
PollingThread.run about to send CacheServerLoadMessage 
type=GatewayReceiverStatistics; load=Load(0.0, 0.00125, 0.0, 1.0)

So, the load sent by the server is not accurate and its ignored in the locator.

I haven't had a chance to look at your PR yet.

Barry
________________________________
From: Jakov Varenina<jakov.varen...@est.tech>
Sent: Thursday, March 10, 2022 1:21 AM
To:dev@geode.apache.org  <dev@geode.apache.org>
Subject: Question related to gateway-receivers connection load balancing

Hi devs,

We have observed some weird behavior related to load balancing of
gateway-receivers connections in the geode cluster. Currently,
gateway-receiver connection load is only updated on coordinator locator
when it provides server location to remote gateway-sender in
ClientConnectionRequest{group=__recv_group...}/ClientConnectionResponse
messages exchange. Other locators never update gateway-receiver
connection load, since they are not handling these messages.
Additionally, locators (including the coordinator) ignore
CacheServerLoadMessage messages that are carrying the receiver's
connection load. This means that locators will not adjust the load when
the connection on some receiver is shut down.

Is this expected behavior or this is a bug?

You can find more information in this PR:

https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F7378%23issuecomment-1048513322&amp;data=04%7C01%7Cboglesby%40vmware.com%7C03d67ae56c2d40ec3d3a08da02777550%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637825009351536446%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=AZhsJhWSNNVRZHQWaKN6QPw%2B4I8c4lMDgSFATS%2F5vEE%3D&amp;reserved=0

and ticket:

https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-10056&amp;data=04%7C01%7Cboglesby%40vmware.com%7C03d67ae56c2d40ec3d3a08da02777550%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637825009351536446%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=XyfNgfYfg2LSNE2HeFlLtnmZ09cLQtx%2FjU5jv7T1qwE%3D&amp;reserved=0

Thanks,

Jakov

Reply via email to