The cherry-pick for GEODE-7088 is clean so I didn't open a PR for that one. The cherry-pick for GEODE-7089 required manual merging due to several unrelated stats changes added to develop recently. The PR to merge that one into release/1.10.0 is here: https://github.com/apache/geode/pull/3976
The original PR for this change when it was added to develop is here: https://github.com/apache/geode/pull/3929 Thanks all for reading and considering. Ryan On Mon, Aug 26, 2019 at 4:22 PM Udo Kohlmeyer <u...@apache.com> wrote: > Thank you Ryan, > > +1 for inclusion > > On 8/26/19 3:33 PM, Ryan McMahon wrote: > > Udo, > > > > Here are inline answers to your questions: > > > > *Is this an existing issue?* > > > > Short answer - yes, but it has never been in a release version of Geode. > > The leak was introduced as part of some changes to address handling > > multiple concurrent registration requests for a given client on a single > > server. The issue is only seen if client registration fails which is not > > particularly common, which is why we are only seeing it now after months > of > > testing. The commit for that was introduced here on April 30th. > > > https://github.com/apache/geode/commit/bc2a2fa5af374cfedfba4dc1abe6cbc2a7b719c8 > > The ConcurrentModificationException issue (which ultimately causes the > > registration to fail) was introduced on April 22nd here: > > > https://github.com/apache/geode/commit/afc311c04f6908a8f725834cdf2c49ce6e902b3f > > > > > > *Why is it more critical to squeeze it into an existing (almost > > release) version of Apache Geode?* > > > > Not sure I totally understand this question, but it is critical because > the > > leak can cause servers to crash due to OOM. Again, because the problems > > were introduced in late April and we haven't released Geode since then, > so > > I think it is very important to merge these fixes into 1.10.0 before we > > release. > > > > > > > > *What guarantees do we have that this fix makes the application more > stable > > compared to adding another hidden issue, which we will discover in > another > > few weeks from now?* > > > > I added numerous tests for this scenario to ensure that the leak would > > never happen regardless of the cause of a failed client registration. > > There obviously is no way to 100% guarantee that there will be no issues > > that arise from the fixes themselves, but our existing test coverage plus > > the new tests I wrote should give us the confidence we need. > > > > Thanks, > > Ryan > > > > On Mon, Aug 26, 2019 at 3:17 PM Udo Kohlmeyer <u...@apache.com> wrote: > > > >> In order to better understand this request: > >> > >> Is this an existing issue? > >> > >> Why is it more critical to squeeze it into an existing (almost release) > >> version of Apache Geode? > >> > >> What guarantees do we have that this fix makes the application more > >> stable compared to adding another hidden issue, which we will discover > >> in another few weeks from now? > >> > >> > >> --Udo > >> > >> On 8/26/19 3:10 PM, Ryan McMahon wrote: > >>> Hi all, > >>> > >>> I would like to propose cherry-picking GEODE-7088 and GEODE-7089 to the > >>> 1.10.0 release branch. The two JIRAs are related to the same root > >> problem, > >>> which I would classify as critical. We discovered a case where a > failed > >>> client registration could lead to a memory leak in a server, eventually > >>> causing the server to crash due to lack of memory. > >>> > >>> The issue is instigated by a ConcurrentModificationException due to > >>> iteration of a non-thread safe collection while it is being mutated > >>> (GEODE-7088). This exception occurs when the client's queue image is > >> being > >>> copied from one server to the next during client registration, and it > >>> causes the client's registration to fail. The client would likely > >> succeed > >>> if it retried registration with that same server, but if it registers > >> with > >>> a different server, we end up leaking events to the client's > registration > >>> queue on the original server (GEODE-7089). > >>> > >>> The fix for GEODE-7088 is to use thread-safe collections for interested > >>> clients in client update messages. The fix for GEODE-7089 is to always > >>> drain and remove the registration queue regardless of success or > failure. > >>> Together, these fixes prevent the failed registrations and memory leak. > >>> > >>> The SHAs for the fixes and tests in develop are: > >>> > >>> GEODE-7088 > >>> - 174af1d23fb7e09eb2bc2fa55479df854850fadb > >>> - 5bb753a8f4ff2886acd8e62d6f51fea58e37881d > >>> > >>> GEODE-7089 > >>> - 5d0153ad4adb1612a1083673f98b1982819a6589 > >>> > >>> This proposal is to cherry-pick these commits to 1.10.0 release branch. > >>> > >>> Thanks, > >>> Ryan McMahon > >>> >