KKcorps opened a new pull request, #14794:
URL: https://github.com/apache/pinot/pull/14794

   # Pauseless Ingestion Failure Resolution
   
   
   Please refer to PR: https://github.com/apache/pinot/pull/14741 for happy 
path. This PR aims to only cover the failure scenarios. Once the above one is 
merged a better diff covering only failures will be visible.
   
   To view only diff covering failure scenarios, for the time being, refer to: 
   
   ## Summary 
   This PR aims to provide ways to resolve the failure scenarios that we can 
encounter during pauseless ingestion. The detailed list of failure scenarios 
can be found here: 
[link](https://docs.google.com/document/d/1d-xttk7sXFIOqfyZvYw5W_KeGS6Ztmi8eYevBcCrT_c/edit?tab=t.0#heading=h.hjzp2hlg4d4o)
  along with the failure handling strategies: 
[link](https://docs.google.com/document/d/1d-xttk7sXFIOqfyZvYw5W_KeGS6Ztmi8eYevBcCrT_c/edit?tab=t.0#heading=h.32w9tdojyszg)
    
   Following sequence diagrams summarizes the failure scenarios and the 
resolution. 
    ![Screenshot 2025-01-03 at 2 53 46 
PM](https://github.com/user-attachments/assets/4a1155cd-fd7f-4832-91ac-16b2d4851963)
   ![Screenshot 2025-01-03 at 2 54 45 
PM](https://github.com/user-attachments/assets/a9b01529-2331-4a9d-8e73-423c23eefb2c)
   
   ## Failure Scenarios & Resolution Approaches
   
   
   Failures encountered during the commit protocol can be categorized into two 
types: recoverable and unrecoverable failures.
   
   **Recoverable failures** are those in which at least one of the servers 
retains the segment on disk.
   
   **Unrecoverable failures** occur when none of the servers have the segment 
on disk.
   
   ### Recoverable Failures
   
   Recoverable failures will be addressed through 
RealtimeSegmentValidationManager. This approach will handle scenarios such as 
**upload** failures and **incomplete** commit protocol executions.
   
   The controller or server can run into issues in between any of the steps of 
the commit protocol as listed below:
   
   Request Type: **COMMIT_START** 
   1. Update the Segment ZK metadata for the committing segment (seg__0__0)
       - Change status to **COMMITTING**
       - Set endOffset
   2. Create Segment ZK metadata for the new segment (seg__0__1) with status 
IN_PROGRESS 
   3. Update the Ideal State for the: 
       - Committing segment (seg__0__0) to ONLINE
       - New/ Consuming segment (seg__0__1) to CONSUMING
   
   Request Type: **COMMIT_END_METADATA**
   4. Update Segment ZK metadata for the committing segment (seg__0__0):
       - Change status to DONE.
       - Update deepstore url.
       - Any additional metadata.
       
   The RealtimeSegmentValidationManager figures out which step of the commit 
protocol failed and how can it be fixed. This is very similar to how commit 
protocol failures were handled before with some minor changes.
   
   ### Non-recoverable Failures
   
   These failures require ingesting the segment again from upstream, followed 
by build, upload and ZK metadata update. 
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to