+1 Thanks for the details!
> On Aug 26, 2019, at 3:33 PM, Ryan McMahon <rmcma...@pivotal.io> wrote: > > Udo, > > Here are inline answers to your questions: > > *Is this an existing issue?* > > Short answer - yes, but it has never been in a release version of Geode. > The leak was introduced as part of some changes to address handling > multiple concurrent registration requests for a given client on a single > server. The issue is only seen if client registration fails which is not > particularly common, which is why we are only seeing it now after months of > testing. The commit for that was introduced here on April 30th. > https://github.com/apache/geode/commit/bc2a2fa5af374cfedfba4dc1abe6cbc2a7b719c8 > The ConcurrentModificationException issue (which ultimately causes the > registration to fail) was introduced on April 22nd here: > https://github.com/apache/geode/commit/afc311c04f6908a8f725834cdf2c49ce6e902b3f > > > *Why is it more critical to squeeze it into an existing (almost > release) version of Apache Geode?* > > Not sure I totally understand this question, but it is critical because the > leak can cause servers to crash due to OOM. Again, because the problems > were introduced in late April and we haven't released Geode since then, so > I think it is very important to merge these fixes into 1.10.0 before we > release. > > > > *What guarantees do we have that this fix makes the application more stable > compared to adding another hidden issue, which we will discover in another > few weeks from now?* > > I added numerous tests for this scenario to ensure that the leak would > never happen regardless of the cause of a failed client registration. > There obviously is no way to 100% guarantee that there will be no issues > that arise from the fixes themselves, but our existing test coverage plus > the new tests I wrote should give us the confidence we need. > > Thanks, > Ryan > > On Mon, Aug 26, 2019 at 3:17 PM Udo Kohlmeyer <u...@apache.com> wrote: > >> In order to better understand this request: >> >> Is this an existing issue? >> >> Why is it more critical to squeeze it into an existing (almost release) >> version of Apache Geode? >> >> What guarantees do we have that this fix makes the application more >> stable compared to adding another hidden issue, which we will discover >> in another few weeks from now? >> >> >> --Udo >> >> On 8/26/19 3:10 PM, Ryan McMahon wrote: >>> Hi all, >>> >>> I would like to propose cherry-picking GEODE-7088 and GEODE-7089 to the >>> 1.10.0 release branch. The two JIRAs are related to the same root >> problem, >>> which I would classify as critical. We discovered a case where a failed >>> client registration could lead to a memory leak in a server, eventually >>> causing the server to crash due to lack of memory. >>> >>> The issue is instigated by a ConcurrentModificationException due to >>> iteration of a non-thread safe collection while it is being mutated >>> (GEODE-7088). This exception occurs when the client's queue image is >> being >>> copied from one server to the next during client registration, and it >>> causes the client's registration to fail. The client would likely >> succeed >>> if it retried registration with that same server, but if it registers >> with >>> a different server, we end up leaking events to the client's registration >>> queue on the original server (GEODE-7089). >>> >>> The fix for GEODE-7088 is to use thread-safe collections for interested >>> clients in client update messages. The fix for GEODE-7089 is to always >>> drain and remove the registration queue regardless of success or failure. >>> Together, these fixes prevent the failed registrations and memory leak. >>> >>> The SHAs for the fixes and tests in develop are: >>> >>> GEODE-7088 >>> - 174af1d23fb7e09eb2bc2fa55479df854850fadb >>> - 5bb753a8f4ff2886acd8e62d6f51fea58e37881d >>> >>> GEODE-7089 >>> - 5d0153ad4adb1612a1083673f98b1982819a6589 >>> >>> This proposal is to cherry-pick these commits to 1.10.0 release branch. >>> >>> Thanks, >>> Ryan McMahon >>> >>