void-ptr974 commented on code in PR #25795:
URL: https://github.com/apache/pulsar/pull/25795#discussion_r3357838051


##########
managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java:
##########
@@ -1749,8 +1767,11 @@ public void operationComplete(Void v, Stat stat) {
                     synchronized (ManagedLedgerImpl.this) {
                         try {
                             State state = 
STATE_UPDATER.get(ManagedLedgerImpl.this);
-                            if (state == State.Closed || state.isFenced()) {
-                                log.debug().log("skip ledger update after 
create complete ledger is closed or fenced");
+                            if (state == State.Closed || state == 
State.Terminated || state.isFenced()) {

Review Comment:
   Thanks for the detailed review. I agree the terminate-vs-rollover metadata 
race is important.
   
   For this PR, I would like to keep the scope focused on the immediate 
terminated-state and pending-add behavior: once termination has taken effect, 
late callbacks should not move the managed ledger back to a writable state, and 
pending add callbacks should complete consistently.
   
   The metadata race is a different path and the fix is more involved. A 
focused follow-up PR would need to handle the terminate metadata update 
together with the `metadataMutex`-serialized metadata update path, including:
   
   - stale rollover metadata success after termination, where we may need 
metadata cleanup and unused BookKeeper ledger cleanup;
   - stale rollover metadata failure after termination wins, where 
`BadVersionException` should not fence a ledger that is already 
terminating/terminated.
   
   Mixing that into this PR would make the review much harder because it 
combines state/callback behavior with metadata serialization and ledger 
cleanup. My preference is to keep this PR focused, then send a separate PR for 
the `metadataMutex` / stale rollover metadata path.
   
   
   
   
   I agree that a separate `Terminating` state would be a cleaner long-term 
model.
   
   The distinction I have in mind is:
   
   - `Terminating`: termination has started on this broker. New writes and new 
rollovers should be rejected, late callbacks should not move the ledger back to 
writable, and existing in-flight adds should either drain or fail 
deterministically.
   - `Terminated`: the terminate metadata update has succeeded and 
`terminatedPosition` is durable.
   
   With this model, `Terminating` does not need to be the durable state. If the 
broker crashes before the metadata update succeeds, recovery can treat the 
ledger as non-terminated and the caller can retry termination. If the metadata 
update succeeds, recovery observes `terminatedPosition` and restores 
`Terminated`.
   
   This would make the state machine easier to reason about, but it still needs 
a clear design for how it interacts with rollover callbacks, pending add 
entries, metadata updates, and `BadVersionException` handling. It also would 
not replace the need to fix the `metadataMutex` race; it mainly gives us a 
clearer lifecycle boundary so those cases can be handled consistently.
   
   I think this is worth doing as a follow-up design/change.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to