+1

Thanks for the details!

> On Aug 26, 2019, at 3:33 PM, Ryan McMahon <rmcma...@pivotal.io> wrote:
> 
> Udo,
> 
> Here are inline answers to your questions:
> 
> *Is this an existing issue?*
> 
> Short answer - yes, but it has never been in a release version of Geode.
> The leak was introduced as part of some changes to address handling
> multiple concurrent registration requests for a given client on a single
> server.  The issue is only seen if client registration fails which is not
> particularly common, which is why we are only seeing it now after months of
> testing.  The commit for that was introduced here on April 30th.
> https://github.com/apache/geode/commit/bc2a2fa5af374cfedfba4dc1abe6cbc2a7b719c8
> The ConcurrentModificationException issue (which ultimately causes the
> registration to fail) was introduced on April 22nd here:
> https://github.com/apache/geode/commit/afc311c04f6908a8f725834cdf2c49ce6e902b3f
> 
> 
> *Why is it more critical to squeeze it into an existing (almost
> release) version of Apache Geode?*
> 
> Not sure I totally understand this question, but it is critical because the
> leak can cause servers to crash due to OOM.  Again, because the problems
> were introduced in late April and we haven't released Geode since then, so
> I think it is very important to merge these fixes into 1.10.0 before we
> release.
> 
> 
> 
> *What guarantees do we have that this fix makes the application more stable
> compared to adding another hidden issue, which we will discover in another
> few weeks from now?*
> 
> I added numerous tests for this scenario to ensure that the leak would
> never happen regardless of the cause of a failed client registration.
> There obviously is no way to 100% guarantee that there will be no issues
> that arise from the fixes themselves, but our existing test coverage plus
> the new tests I wrote should give us the confidence we need.
> 
> Thanks,
> Ryan
> 
> On Mon, Aug 26, 2019 at 3:17 PM Udo Kohlmeyer <u...@apache.com> wrote:
> 
>> In order to better understand this request:
>> 
>> Is this an existing issue?
>> 
>> Why is it more critical to squeeze it into an existing (almost release)
>> version of Apache Geode?
>> 
>> What guarantees do we have that this fix makes the application more
>> stable compared to adding another hidden issue, which we will discover
>> in another few weeks from now?
>> 
>> 
>> --Udo
>> 
>> On 8/26/19 3:10 PM, Ryan McMahon wrote:
>>> Hi all,
>>> 
>>> I would like to propose cherry-picking GEODE-7088 and GEODE-7089 to the
>>> 1.10.0 release branch.  The two JIRAs are related to the same root
>> problem,
>>> which I would classify as critical.  We discovered a case where a failed
>>> client registration could lead to a memory leak in a server, eventually
>>> causing the server to crash due to lack of memory.
>>> 
>>> The issue is instigated by a ConcurrentModificationException due to
>>> iteration of a non-thread safe collection while it is being mutated
>>> (GEODE-7088).  This exception occurs when the client's queue image is
>> being
>>> copied from one server to the next during client registration, and it
>>> causes the client's registration to fail.  The client would likely
>> succeed
>>> if it retried registration with that same server, but if it registers
>> with
>>> a different server, we end up leaking events to the client's registration
>>> queue on the original server (GEODE-7089).
>>> 
>>> The fix for GEODE-7088 is to use thread-safe collections for interested
>>> clients in client update messages.  The fix for GEODE-7089 is to always
>>> drain and remove the registration queue regardless of success or failure.
>>> Together, these fixes prevent the failed registrations and memory leak.
>>> 
>>> The SHAs for the fixes and tests in develop are:
>>> 
>>> GEODE-7088
>>> - 174af1d23fb7e09eb2bc2fa55479df854850fadb
>>> - 5bb753a8f4ff2886acd8e62d6f51fea58e37881d
>>> 
>>> GEODE-7089
>>> - 5d0153ad4adb1612a1083673f98b1982819a6589
>>> 
>>> This proposal is to cherry-pick these commits to 1.10.0 release branch.
>>> 
>>> Thanks,
>>> Ryan McMahon
>>> 
>> 

Reply via email to