[ 
https://issues.apache.org/jira/browse/RATIS-2485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18072990#comment-18072990
 ] 

Ao Li commented on RATIS-2485:
------------------------------

Hi [~szetszwo] sorry for the delay.

My work was detoured by another test failure when I was implementing the patch.

I saw the following failure while running the test: 
`org.apache.ratis.server.simulation.TestRaftReconfigurationWithSimulatedRpc#testReconfTwice`

[^log.txt]

It seems like a race condition, but I’m not very familiar with the codebase and 
haven’t been able to debug it. Let me know if you want me to create a separate 
issue to track this.

I’m currently working on a controlled concurrency testing framework, Fray 
([https://github.com/cmu-pasta/fray]), which helps developers debug issues like 
this. Using Fray, I was able to reliably identify the bug. I’ve also prepared a 
patch that integrates Fray with Ratis 
([https://github.com/aoli-al/ratis/commits/push-vqulnrpzutzs/]).

You can try the patch and observe the test failure locally. With the patch 
applied, run:
`./mvnw 
-Dtest=org.apache.ratis.server.simulation.TestRaftReconfigurationWithSimulatedRpc#testReconfTwice
 test`
to reproduce the issue (add -Dmaven.surefire.debug if you want to start the 
debugger). Note that deterministic replay is not fully supported yet, as Fray 
does not currently handle sources of nondeterminism such as randomness and time.

Please let me know if you need any additional information, or if you’d be 
interested in using Fray to test Ratis more broadly.

Thanks again for maintaining such a great project. I’d be excited to help make 
it even more robust.

> Improve the IllegalStateException message from LeaderStateImpl.updateCommit
> ---------------------------------------------------------------------------
>
>                 Key: RATIS-2485
>                 URL: https://issues.apache.org/jira/browse/RATIS-2485
>             Project: Ratis
>          Issue Type: Improvement
>          Components: Leader
>            Reporter: Ao Li
>            Assignee: Ao Li
>            Priority: Major
>         Attachments: log.txt
>
>
> I saw the following exception:
> {code:java}
> java.lang.IllegalStateException: n1@group-ABB3109A44C1-SegmentedRaftLog is 
> expected to be opened but it is CLOSED
>       at 
> org.apache.ratis.util.OpenCloseState.assertOpen(OpenCloseState.java:63)
>       at 
> org.apache.ratis.server.raftlog.RaftLogBase.checkLogState(RaftLogBase.java:112)
>       at 
> org.apache.ratis.server.raftlog.RaftLogBase.appendMetadataImpl(RaftLogBase.java:220)
>       at 
> org.apache.ratis.server.raftlog.RaftLogBase.lambda$appendMetadata$4(RaftLogBase.java:216)
>       at 
> org.apache.ratis.server.raftlog.RaftLogSequentialOps$Runner.runSequentially(RaftLogSequentialOps.java:69)
>       at 
> org.apache.ratis.server.raftlog.RaftLogBase.appendMetadata(RaftLogBase.java:216)
>       at 
> org.apache.ratis.server.impl.LeaderStateImpl.logMetadata(LeaderStateImpl.java:982)
>       at 
> org.apache.ratis.server.impl.LeaderStateImpl.updateCommit(LeaderStateImpl.java:960)
>       at 
> org.apache.ratis.server.impl.LeaderStateImpl.updateCommit(LeaderStateImpl.java:975)
>       at 
> org.apache.ratis.server.impl.LeaderStateImpl.lambda$updateCommit$18(LeaderStateImpl.java:902)
>       at java.base/java.util.Optional.ifPresent(Optional.java:178)
>       at 
> org.apache.ratis.server.impl.LeaderStateImpl.updateCommit(LeaderStateImpl.java:902)
>       at 
> org.apache.ratis.server.impl.LeaderStateImpl$StateUpdateEvent.execute(LeaderStateImpl.java:123)
>       at 
> org.apache.ratis.server.impl.LeaderStateImpl$EventProcessor.run(LeaderStateImpl.java:767)
> Caused by: org.apache.ratis.util.OpenCloseState$CloseTrace: Close 
> n1@group-ABB3109A44C1-SegmentedRaftLog
>       at 
> org.apache.ratis.util.OpenCloseState.lambda$close$1(OpenCloseState.java:109)
>       at 
> java.base/java.util.concurrent.atomic.AtomicReference.getAndUpdate(AtomicReference.java:184)
>       at org.apache.ratis.util.OpenCloseState.close(OpenCloseState.java:109)
>       at 
> org.apache.ratis.server.raftlog.RaftLogBase.close(RaftLogBase.java:380)
>       at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.close(SegmentedRaftLog.java:528)
>       at org.apache.ratis.server.impl.ServerState.close(ServerState.java:434)
>       at 
> org.apache.ratis.server.impl.RaftServerImpl.lambda$close$1(RaftServerImpl.java:535)
>       at 
> org.apache.ratis.util.LifeCycle.lambda$checkStateAndClose$7(LifeCycle.java:306)
>       at 
> org.apache.ratis.util.LifeCycle.checkStateAndClose(LifeCycle.java:326)
>       at 
> org.apache.ratis.util.LifeCycle.checkStateAndClose(LifeCycle.java:304)
>       at 
> org.apache.ratis.server.impl.RaftServerImpl.close(RaftServerImpl.java:512)
>       at 
> org.apache.ratis.server.impl.RaftServerProxy$ImplMap.close(RaftServerProxy.java:136)
>       at 
> org.apache.ratis.server.impl.RaftServerProxy$ImplMap.lambda$close$0(RaftServerProxy.java:123)
>       at 
> org.apache.ratis.util.ConcurrentUtils.accept(ConcurrentUtils.java:203)
>       at 
> org.apache.ratis.util.ConcurrentUtils.lambda$null$4(ConcurrentUtils.java:182)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1090)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:614)
>       at java.base/java.lang.Thread.run(Thread.java:1474)
> {code}
> with the following scenario:
> Thread 1:
> Leader shutdown -> SegmentedRaftLog.close()
> Thread 2:
> LeaderStateImpl$EventProcessor.run() -> StateUpdateEvent.execute() -> 
> updateCommit() -> logMetadata() -> RaftLogBase.appendMetadata()
> I did some searching before creating this issue, but I didn’t find any 
> mention that the exception can be thrown in the updateCommit() method.
> Sorry in advance if this is a dup or expected behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to