[ 
https://issues.apache.org/jira/browse/IGNITE-28395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-28395:
----------------------------------
    Description: 
Lease Updater fires invoke to Meta storage asynchronously every 500ms without 
waiting for the previous one to complete. This causes multiple concurrent 
invocations with the same expected lease state — only one wins the CAS, the 
rest fail with
{code:java}
Lease update invocation failed because of outdated lease data on this node{code}
As a result, roughly once per minute the lease expires before renewal.

Simply reading fresh data from storage before each invoke does not help: 
previous invocations are already in-flight and will complete after the read, 
making the freshly-read state outdated by the time the new invoke reaches 
storage.

*Fix*
Track in-flight invoke as a future. On each tick, if the previous future is not 
complete — block with `future.get(timeout)` before reading from lease tracker 
and firing the next invoke. This guarantees at most one in-flight invoke at any 
time and that the lease state is read only after the previous update has 
landed. Timeout should be around leaseInterval/2 - after that, the leases most 
likely will expire anyway.

Also, there may be lag between future completion and lease map update in lease 
tracker, so lease map still may be stale. We can return written leases from 
successful invoke itself. In the case of invoke failure, the map from lease 
tracker should be used.

  was:
Lease Updater fires invoke to Meta storage asynchronously every 500ms without 
waiting for the previous one to complete. This causes multiple concurrent 
invocations with the same expected lease state — only one wins the CAS, the 
rest fail with
{code:java}
Lease update invocation failed because of outdated lease data on this node{code}
As a result, roughly once per minute the lease expires before renewal.

Simply reading fresh data from storage before each invoke does not help: 
previous invocations are already in-flight and will complete after the read, 
making the freshly-read state outdated by the time the new invoke reaches 
storage.

*Fix*
Track in-flight invoke as a future. On each tick, if the previous future is not 
complete — block with `future.get(timeout)` before reading from lease tracker 
and firing the next invoke. This guarantees at most one in-flight invoke at any 
time and that the lease state is read only after the previous update has 
landed. Timeout should be around leaseInterval/2 - after that, the leases most 
likely will expire anyway.


> Lease updater accumulates concurrent in-flight invocations causing constant 
> CAS failures
> ----------------------------------------------------------------------------------------
>
>                 Key: IGNITE-28395
>                 URL: https://issues.apache.org/jira/browse/IGNITE-28395
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Denis Chudov
>            Priority: Major
>              Labels: ignite-3
>
> Lease Updater fires invoke to Meta storage asynchronously every 500ms without 
> waiting for the previous one to complete. This causes multiple concurrent 
> invocations with the same expected lease state — only one wins the CAS, the 
> rest fail with
> {code:java}
> Lease update invocation failed because of outdated lease data on this 
> node{code}
> As a result, roughly once per minute the lease expires before renewal.
> Simply reading fresh data from storage before each invoke does not help: 
> previous invocations are already in-flight and will complete after the read, 
> making the freshly-read state outdated by the time the new invoke reaches 
> storage.
> *Fix*
> Track in-flight invoke as a future. On each tick, if the previous future is 
> not complete — block with `future.get(timeout)` before reading from lease 
> tracker and firing the next invoke. This guarantees at most one in-flight 
> invoke at any time and that the lease state is read only after the previous 
> update has landed. Timeout should be around leaseInterval/2 - after that, the 
> leases most likely will expire anyway.
> Also, there may be lag between future completion and lease map update in 
> lease tracker, so lease map still may be stale. We can return written leases 
> from successful invoke itself. In the case of invoke failure, the map from 
> lease tracker should be used.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to