[ 
https://issues.apache.org/jira/browse/HBASE-29975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HBASE-29975 started by Andrew Kyle Purtell.
---------------------------------------------------
> TLA+ specification of the AssignmentManager
> -------------------------------------------
>
>                 Key: HBASE-29975
>                 URL: https://issues.apache.org/jira/browse/HBASE-29975
>             Project: HBase
>          Issue Type: Task
>          Components: master, proc-v2, Region Assignment
>            Reporter: Andrew Kyle Purtell
>            Assignee: Andrew Kyle Purtell
>            Priority: Major
>
> TLA+ is a formal specification language for designing and verifying 
> concurrent and distributed systems. The name stands for the "Temporal Logic 
> of Actions", a mathematical framework that combines first-order logic with 
> temporal operators to reason about how system state evolves over time. When 
> the model is a high-fidelity representation of the real system, proposed 
> design and architectural changes can be checked against the full space of 
> possible executions, surfacing logic bugs at design time, before any code is 
> written. This can save weeks or months of development effort that would 
> otherwise be spent discovering and debugging subtle concurrency issues in a 
> running system.
> A TLA+ specification describes a system as a state machine, an initial state 
> predicate (Init), a next-state relation (Next) that defines every legal 
> transition, and a collection of invariants, properties that must hold in 
> every reachable state. The TLC model checker then systematically explores 
> every possible execution of this state machine, checking each property at 
> every state. If a property is violated, TLC produces a minimal counterexample 
> trace showing the exact sequence of steps that led to the failure.
> The HBase AssignmentManager is a core component of the HBase master that 
> manages the lifecycle of regions across a cluster of RegionServers. It 
> coordinates region assignment, unassignment, moves, and reopens, handles 
> RegionServer crashes through the ServerCrashProcedure (SCP), and recovers its 
> own state after a master crash through a durable procedure store. The 
> correctness of these interactions is critical.
> This TLA+ specification models the AssignmentManager as a state machine 
> capturing:
> * Region lifecycle
> * Asynchronous RPC channels: Master-to-RS commands and RS-to-master 
> transition reports
> * Procedure state: State records (type, step, target  server, retry count), 
> with a durable procedure store that survives master crashes
> * Server liveness: Per-server online/crashed state, ZooKeeper ephemeral 
> nodes, and WAL fencing state
> * Crash recovery: Multi-step ServerCrashProcedure 
> * PEWorker thread pool
> The specification defines safety invariants verified at every reachable 
> state, including NoDoubleAssignment (no region writable on two servers), 
> MetaConsistency (persistent and in-memory state agree), FencingOrder (WALs 
> fenced before reassignment), NoLostRegions (no region stuck without a 
> procedure after crash recovery), and NoPEWorkerDeadlock (thread pool 
> exhaustion detection). The liveness property (MetaEventuallyAssigned) 
> verifies that hbase:meta is eventually reassigned after a crash. Action 
> constraints enforce transition validity and SCP monotonicity.
> The model is under active development and can found at:
> https://github.com/apurtell/hbase/tree/WORK-architecture/src/main/spec



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to