snleee opened a new issue, #13171:
URL: https://github.com/apache/pinot/issues/13171

   While we clean up the segment lineage & deleting segments in the retention 
manager, we observe that it frequently fails when a lot of segment upload is 
happening in the controller.
   
   Segment upload path is already serialized using a `synchronized` block 
(refer `PinotHelixResourceManager.assignTableSegment()`) based on the table 
lock.
   
   On the other hand, the retention manager will try to update without grabbing 
the lock so it frequently fails to update idealstate.
   
   Potential Improvements:
   
   1. Use better retry policy for updating segment lineage (refer 
`DEFAULT_TABLE_IDEALSTATES_UPDATE_RETRY_POLICY`)
   2. Grab the table lock for idealstate update happening in the retention 
manager. (refer `PinotHelixResourceManager.assignTableSegment`)
   
   
   
   
   This is some example where we observe that the idealstate update failed 
during segment delete from the retention manager.
   ```
   java.lang.RuntimeException: Caught exception while updating ideal state for 
resource: tableXXX_OFFLINE
           at 
org.apache.pinot.common.utils.helix.HelixHelper.updateIdealState(HelixHelper.java:203)
 
~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec]
           at 
org.apache.pinot.common.utils.helix.HelixHelper.updateIdealState(HelixHelper.java:232)
 
~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec]
           at 
org.apache.pinot.common.utils.helix.HelixHelper.removeSegmentsFromIdealState(HelixHelper.java:503)
 
~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec]
           at 
org.apache.pinot.controller.helix.core.PinotHelixResourceManager.deleteSegments(PinotHelixResourceManager.java:1030)
 
~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2
   913878a31071ec]
           at 
org.apache.pinot.controller.helix.core.PinotHelixResourceManager.deleteSegments(PinotHelixResourceManager.java:1013)
 
~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2
   913878a31071ec]
           at 
org.apache.pinot.controller.helix.core.retention.RetentionManager.lambda$manageSegmentLineageCleanupForTable$0(RetentionManager.java:223)
 
~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a
   0aa760c824774f3ceb1a2913878a31071ec]
           at 
org.apache.pinot.spi.utils.retry.BaseRetryPolicy.attempt(BaseRetryPolicy.java:58)
 
~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec]
           at 
org.apache.pinot.controller.helix.core.retention.RetentionManager.manageSegmentLineageCleanupForTable(RetentionManager.java:195)
 
~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c82
   4774f3ceb1a2913878a31071ec]
           at 
org.apache.pinot.controller.helix.core.retention.RetentionManager.processTable(RetentionManager.java:86)
 
~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071
   ec]
           at 
org.apache.pinot.controller.helix.core.periodictask.ControllerPeriodicTask.processTable(ControllerPeriodicTask.java:145)
 
~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ce
   b1a2913878a31071ec]
           at 
org.apache.pinot.controller.helix.core.periodictask.ControllerPeriodicTask.processTables(ControllerPeriodicTask.java:118)
 
~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3c
   eb1a2913878a31071ec]
           at 
org.apache.pinot.controller.helix.core.periodictask.ControllerPeriodicTask.runTask(ControllerPeriodicTask.java:81)
 
~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a291
   3878a31071ec]
           at 
org.apache.pinot.core.periodictask.BasePeriodicTask.run(BasePeriodicTask.java:150)
 
~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec]
           at 
org.apache.pinot.core.periodictask.BasePeriodicTask.run(BasePeriodicTask.java:135)
 
~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec]
           at 
org.apache.pinot.core.periodictask.PeriodicTaskScheduler.lambda$start$0(PeriodicTaskScheduler.java:87)
 
~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec
   ]
           at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
 ~[?:?]
           at 
java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:358) 
~[?:?]
           at 
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
 ~[?:?]
           at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
 ~[?:?]
           at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
 ~[?:?]
           at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
   Caused by: org.apache.pinot.spi.utils.retry.AttemptsExceededException: 
Operation failed after 5 attempts
           at 
org.apache.pinot.spi.utils.retry.BaseRetryPolicy.attempt(BaseRetryPolicy.java:65)
 
~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec]
           at 
org.apache.pinot.common.utils.helix.HelixHelper.updateIdealState(HelixHelper.java:104)
 
~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec]
           ... 20 more
   ```
   
   Also, we observe that updating segment lineage znode also fails sometimes:
   ```
   2024/05/16 01:59:39.041 ERROR [RetentionManager] [pool-20-thread-4] Failed 
to clean up the segment lineage. (tableName = tableXXX_OFFLINE)
   org.apache.pinot.spi.utils.retry.AttemptsExceededException: Operation failed 
after 5 attempts
           at 
org.apache.pinot.spi.utils.retry.BaseRetryPolicy.attempt(BaseRetryPolicy.java:65)
 
~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec]
           at 
org.apache.pinot.controller.helix.core.retention.RetentionManager.manageSegmentLineageCleanupForTable(RetentionManager.java:195)
 
~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec]
           at 
org.apache.pinot.controller.helix.core.retention.RetentionManager.processTable(RetentionManager.java:86)
 
~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec]
           at 
org.apache.pinot.controller.helix.core.periodictask.ControllerPeriodicTask.processTable(ControllerPeriodicTask.java:145)
 
~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec]
           at 
org.apache.pinot.controller.helix.core.periodictask.ControllerPeriodicTask.processTables(ControllerPeriodicTask.java:118)
 
~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec]
           at 
org.apache.pinot.controller.helix.core.periodictask.ControllerPeriodicTask.runTask(ControllerPeriodicTask.java:81)
 
~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec]
           at 
org.apache.pinot.core.periodictask.BasePeriodicTask.run(BasePeriodicTask.java:150)
 
~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec]
           at 
org.apache.pinot.core.periodictask.BasePeriodicTask.run(BasePeriodicTask.java:135)
 
~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec]
           at 
org.apache.pinot.core.periodictask.PeriodicTaskScheduler.lambda$start$0(PeriodicTaskScheduler.java:87)
 
~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec]
           at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
 ~[?:?]
           at 
java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:358) 
~[?:?]
           at 
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
 ~[?:?]
           at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
 ~[?:?]
           at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
 ~[?:?]
           at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
   2024/05/16 01:59:39.041 ERROR [ControllerPeriodicTask] [pool-20-thread-4] 
Caught exception while processing table: tableXXX_OFFLINE in task: 
RetentionManager
   java.lang.RuntimeException: Failed to clean up the segment lineage. 
(tableName = tableXXX_OFFLINE)
           at 
org.apache.pinot.controller.helix.core.retention.RetentionManager.manageSegmentLineageCleanupForTable(RetentionManager.java:237)
 
~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec]
           at 
org.apache.pinot.controller.helix.core.retention.RetentionManager.processTable(RetentionManager.java:86)
 
~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec]
           at 
org.apache.pinot.controller.helix.core.periodictask.ControllerPeriodicTask.processTable(ControllerPeriodicTask.java:145)
 
~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec]
           at 
org.apache.pinot.controller.helix.core.periodictask.ControllerPeriodicTask.processTables(ControllerPeriodicTask.java:118)
 
~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec]
           at 
org.apache.pinot.controller.helix.core.periodictask.ControllerPeriodicTask.runTask(ControllerPeriodicTask.java:81)
 
~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec]
           at 
org.apache.pinot.core.periodictask.BasePeriodicTask.run(BasePeriodicTask.java:150)
 
~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec]
           at 
org.apache.pinot.core.periodictask.BasePeriodicTask.run(BasePeriodicTask.java:135)
 
~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec]
           at 
org.apache.pinot.core.periodictask.PeriodicTaskScheduler.lambda$start$0(PeriodicTaskScheduler.java:87)
 
~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec]
           at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
 ~[?:?]
           at 
java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:358) 
~[?:?]
           at 
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
 ~[?:?]
           at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
 ~[?:?]
           at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
 ~[?:?]
           at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
   Caused by: org.apache.pinot.spi.utils.retry.AttemptsExceededException: 
Operation failed after 5 attempts
           at 
org.apache.pinot.spi.utils.retry.BaseRetryPolicy.attempt(BaseRetryPolicy.java:65)
 
~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec]
           at 
org.apache.pinot.controller.helix.core.retention.RetentionManager.manageSegmentLineageCleanupForTable(RetentionManager.java:195)
 
~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec]
           ... 13 more
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to