[ https://issues.apache.org/jira/browse/GEODE-8553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17233977#comment-17233977 ]
ASF subversion and git services commented on GEODE-8553: -------------------------------------------------------- Commit 72455cd0811bf793a8680b839bbc3ed957ad1830 in geode-native's branch refs/heads/develop from Mario Salazar de Torres [ https://gitbox.apache.org/repos/asf?p=geode-native.git;h=72455cd ] GEODE-8553: Fix inter-locks in ThinClientLocator (#660) - ThinClientLocatorHelper uses a mutex called m_locatorLock, which is used in the entire scope of all the public member functions for this class. - Problem with that is all of those functions perform some network communication and in the event of networking issues, this will result in having all the application threads inter-locked. - Also, the only purpose of this mutex is to avoid that the locators list is read while it's being updated. - Given all of the above, the class has been refactored, so every time the locators list has to be accessed, a copy of it is created, being that copy created only while owning the mutex. - And for the case of the update, a new locators list is composed and its only at the end of the update function where the mutex is locked and the locators list swapped by the new one. - This whole change ensures that the time the mutex is in use is minimal while the functionality remains intact. - Refactored ConnectionWrapper so is move constructible. - Locators list is now initialized using iterator range constructor. - Added getConnRetries utility function. - Changed mutex for a shared mutex. - getLocators function now returns a shuffled list each time. - Cleaned-up repeated code. As a result of this createConnection has been refactored and a new function called sendRequest has been created. - Implemented proposal for locator connections retries, treating the locators list as a circular buffer. - Added new integration test to verify that threads don't inter-lock anymore. - The test works as follow: * Set 5s connect-timeout. * Deploy 1 locator. * Add 1 non-existent locator to the pool. * Send 64 concurrent requests to get the server list. * By probability if there are no interlocks ~50%+-E% of the requests should have completed in a brief period of time. * Note that E% is the error allowance percentage, which is set to 10%. * Therefore a 10 seconds timeout is set in order to reach the request completed treshold. * Test passes if the number of completed requests before the timeout was greater or equal to the thresholds, otherwise it fails. > Reduce ThinClientLocatorHelper lock time > ---------------------------------------- > > Key: GEODE-8553 > URL: https://issues.apache.org/jira/browse/GEODE-8553 > Project: Geode > Issue Type: Improvement > Components: native client > Affects Versions: 1.12.0, 1.13.0 > Reporter: Mario Salazar de Torres > Assignee: Mario Salazar de Torres > Priority: Major > Labels: pull-request-available > > During my troublshootings, I've seen that locking m_locatorLock for the whole > scope of the class function members might cause some inter-locks. > Problem here and in many other places of the NC is that networking operations > are performed while a mutex is locked. Therefore if *thread A* takes longer > than expected in performing its network operation, it might block another one > which does not requires any resource of the first *thread A*. Hence, the > inter-lock. > This improvement is the first one of a series regarding to lock scope > reduction when it comes with code regarding networking in NC. > -- This message was sent by Atlassian Jira (v8.3.4#803005)