Udo, Here are inline answers to your questions:
*Is this an existing issue?* Short answer - yes, but it has never been in a release version of Geode. The leak was introduced as part of some changes to address handling multiple concurrent registration requests for a given client on a single server. The issue is only seen if client registration fails which is not particularly common, which is why we are only seeing it now after months of testing. The commit for that was introduced here on April 30th. https://github.com/apache/geode/commit/bc2a2fa5af374cfedfba4dc1abe6cbc2a7b719c8 The ConcurrentModificationException issue (which ultimately causes the registration to fail) was introduced on April 22nd here: https://github.com/apache/geode/commit/afc311c04f6908a8f725834cdf2c49ce6e902b3f *Why is it more critical to squeeze it into an existing (almost release) version of Apache Geode?* Not sure I totally understand this question, but it is critical because the leak can cause servers to crash due to OOM. Again, because the problems were introduced in late April and we haven't released Geode since then, so I think it is very important to merge these fixes into 1.10.0 before we release. *What guarantees do we have that this fix makes the application more stable compared to adding another hidden issue, which we will discover in another few weeks from now?* I added numerous tests for this scenario to ensure that the leak would never happen regardless of the cause of a failed client registration. There obviously is no way to 100% guarantee that there will be no issues that arise from the fixes themselves, but our existing test coverage plus the new tests I wrote should give us the confidence we need. Thanks, Ryan On Mon, Aug 26, 2019 at 3:17 PM Udo Kohlmeyer <u...@apache.com> wrote: > In order to better understand this request: > > Is this an existing issue? > > Why is it more critical to squeeze it into an existing (almost release) > version of Apache Geode? > > What guarantees do we have that this fix makes the application more > stable compared to adding another hidden issue, which we will discover > in another few weeks from now? > > > --Udo > > On 8/26/19 3:10 PM, Ryan McMahon wrote: > > Hi all, > > > > I would like to propose cherry-picking GEODE-7088 and GEODE-7089 to the > > 1.10.0 release branch. The two JIRAs are related to the same root > problem, > > which I would classify as critical. We discovered a case where a failed > > client registration could lead to a memory leak in a server, eventually > > causing the server to crash due to lack of memory. > > > > The issue is instigated by a ConcurrentModificationException due to > > iteration of a non-thread safe collection while it is being mutated > > (GEODE-7088). This exception occurs when the client's queue image is > being > > copied from one server to the next during client registration, and it > > causes the client's registration to fail. The client would likely > succeed > > if it retried registration with that same server, but if it registers > with > > a different server, we end up leaking events to the client's registration > > queue on the original server (GEODE-7089). > > > > The fix for GEODE-7088 is to use thread-safe collections for interested > > clients in client update messages. The fix for GEODE-7089 is to always > > drain and remove the registration queue regardless of success or failure. > > Together, these fixes prevent the failed registrations and memory leak. > > > > The SHAs for the fixes and tests in develop are: > > > > GEODE-7088 > > - 174af1d23fb7e09eb2bc2fa55479df854850fadb > > - 5bb753a8f4ff2886acd8e62d6f51fea58e37881d > > > > GEODE-7089 > > - 5d0153ad4adb1612a1083673f98b1982819a6589 > > > > This proposal is to cherry-pick these commits to 1.10.0 release branch. > > > > Thanks, > > Ryan McMahon > > >