J-HowHuang opened a new pull request, #16886:
URL: https://github.com/apache/pinot/pull/16886

   ## Description
   Tenant rebalance job could not be cancelled by API because the 
`TenantRebalancer`, unlike `TableRebalancer`, was not synced or checking with 
ZK. Currently `TenantRebalancer` relies on `ZkBasedTenantRebalanceObserver` to 
update the rebalance context to ZK, since 
https://github.com/apache/pinot/pull/16455, but never check with any updates on 
the content in ZK. This introduce problems:
   1. When `TenantRebalanceChecker` determines a job is stuck and abort the 
job, it clears the queues in the context on ZK and spawns a new tenant 
rebalance job. If the job was in fact not stuck and still running by one of the 
controller, the controller wouldn't be aware of the abortion of the job it's 
currently doing, and keep running the next table in the queue. It would then 
update the rebalance context to ZK, which overwrites the aborted job context so 
the abortion becomes ineffective.
   2. When adding the new features that modify the tenant rebalance queues, 
such as tenant rebalance job cancellation, the controller doesn't have a way to 
learn about any update made to the context on ZK, it sticks with the context 
locally instead. This makes any update to a tenant rebalance job impossible to 
be read by the controller.
   
   ## Change
   * Tenant rebalancer now depends on `ZkBasedTenantRebalanceObserver` to poll 
from queue, update status when a job is done. Tenant rebalance job metadata on 
ZK is the only ground truth that controller reads the context from.
   * Add `DELETE /tenants/rebalance/{jobId}` API to cancel a tenant rebalance 
job
   * Change tenant rebalance progress status of each table from 
       `UNPROCESSED - > IN_QUEUE`
       `PROCESSING - > REBALANCING`,
       `PROCESSED -> DONE`,
       `(new) CANCELLED // cancelled by user`
       `ABORTED // cancelled by TenantRebalanceChecker`
       `NOT_SCHEDULED // tables IN_QUEUE will be marked as NOT_SCHEDULED once 
the rebalance job is cancelled/aborted`
   * Remove duplicate code that marks a table rebalance job as 
aborted/cancelled, into `TableRebalanceManager.cancelRebalance`
   
   ## Testing
   
   ### Basic usage verified via quickstart:
   Status before cancellation
   ```
   {
     "timeElapsedSinceStartInSeconds": 61,
     "tenantRebalanceProgressStats": {
       "startTimeMs": 1758737223513,
       "totalTables": 10,
       "completionStatusMsg": null,
       "timeToFinishInSeconds": 0,
       "tableStatusMap": {
         "airlineStats_OFFLINE": "DONE",
         "testUnnest_OFFLINE": "DONE",
         "baseballStats_OFFLINE": "REBALANCING",
         "dimBaseballTeams_OFFLINE": "IN_QUEUE",
         "fineFoodReviews_OFFLINE": "IN_QUEUE",
         "clickstreamFunnel_OFFLINE": "IN_QUEUE",
         "starbucksStores_OFFLINE": "IN_QUEUE",
         "githubEvents_OFFLINE": "IN_QUEUE",
         "githubComplexTypeEvents_OFFLINE": "IN_QUEUE",
         "billing_OFFLINE": "IN_QUEUE"
       },
       "remainingTables": 8,
       "tableRebalanceJobIdMap": {
         "airlineStats_OFFLINE": "99212151-701b-40ab-a58e-8a6b2ea40097",
         "testUnnest_OFFLINE": "53df78b6-ae78-4aec-b021-cd7ccaadd916",
         "baseballStats_OFFLINE": "d17e2d0b-c2a6-479f-8e9b-c824d60a97ab"
       }
     }
   }
   ```
   After cancellation:
   ```
   {
     "timeElapsedSinceStartInSeconds": 74,
     "tenantRebalanceProgressStats": {
       "startTimeMs": 1758737223513,
       "totalTables": 10,
       "completionStatusMsg": "Tenant rebalance job has been cancelled.",
       "timeToFinishInSeconds": 74,
       "tableStatusMap": {
         "airlineStats_OFFLINE": "DONE",
         "testUnnest_OFFLINE": "DONE",
         "baseballStats_OFFLINE": "CANCELLED",
         "dimBaseballTeams_OFFLINE": "NOT_SCHEDULED",
         "fineFoodReviews_OFFLINE": "NOT_SCHEDULED",
         "clickstreamFunnel_OFFLINE": "NOT_SCHEDULED",
         "starbucksStores_OFFLINE": "NOT_SCHEDULED",
         "githubEvents_OFFLINE": "NOT_SCHEDULED",
         "githubComplexTypeEvents_OFFLINE": "NOT_SCHEDULED",
         "billing_OFFLINE": "NOT_SCHEDULED"
       },
       "remainingTables": 0,
       "tableRebalanceJobIdMap": {
         "airlineStats_OFFLINE": "99212151-701b-40ab-a58e-8a6b2ea40097",
         "testUnnest_OFFLINE": "53df78b6-ae78-4aec-b021-cd7ccaadd916",
         "baseballStats_OFFLINE": "d17e2d0b-c2a6-479f-8e9b-c824d60a97ab"
       }
     }
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to