snleee opened a new issue, #13171: URL: https://github.com/apache/pinot/issues/13171
While we clean up the segment lineage & deleting segments in the retention manager, we observe that it frequently fails when a lot of segment upload is happening in the controller. Segment upload path is already serialized using a `synchronized` block (refer `PinotHelixResourceManager.assignTableSegment()`) based on the table lock. On the other hand, the retention manager will try to update without grabbing the lock so it frequently fails to update idealstate. Potential Improvements: 1. Use better retry policy for updating segment lineage (refer `DEFAULT_TABLE_IDEALSTATES_UPDATE_RETRY_POLICY`) 2. Grab the table lock for idealstate update happening in the retention manager. (refer `PinotHelixResourceManager.assignTableSegment`) This is some example where we observe that the idealstate update failed during segment delete from the retention manager. ``` java.lang.RuntimeException: Caught exception while updating ideal state for resource: tableXXX_OFFLINE at org.apache.pinot.common.utils.helix.HelixHelper.updateIdealState(HelixHelper.java:203) ~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec] at org.apache.pinot.common.utils.helix.HelixHelper.updateIdealState(HelixHelper.java:232) ~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec] at org.apache.pinot.common.utils.helix.HelixHelper.removeSegmentsFromIdealState(HelixHelper.java:503) ~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec] at org.apache.pinot.controller.helix.core.PinotHelixResourceManager.deleteSegments(PinotHelixResourceManager.java:1030) ~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2 913878a31071ec] at org.apache.pinot.controller.helix.core.PinotHelixResourceManager.deleteSegments(PinotHelixResourceManager.java:1013) ~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2 913878a31071ec] at org.apache.pinot.controller.helix.core.retention.RetentionManager.lambda$manageSegmentLineageCleanupForTable$0(RetentionManager.java:223) ~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a 0aa760c824774f3ceb1a2913878a31071ec] at org.apache.pinot.spi.utils.retry.BaseRetryPolicy.attempt(BaseRetryPolicy.java:58) ~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec] at org.apache.pinot.controller.helix.core.retention.RetentionManager.manageSegmentLineageCleanupForTable(RetentionManager.java:195) ~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c82 4774f3ceb1a2913878a31071ec] at org.apache.pinot.controller.helix.core.retention.RetentionManager.processTable(RetentionManager.java:86) ~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071 ec] at org.apache.pinot.controller.helix.core.periodictask.ControllerPeriodicTask.processTable(ControllerPeriodicTask.java:145) ~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ce b1a2913878a31071ec] at org.apache.pinot.controller.helix.core.periodictask.ControllerPeriodicTask.processTables(ControllerPeriodicTask.java:118) ~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3c eb1a2913878a31071ec] at org.apache.pinot.controller.helix.core.periodictask.ControllerPeriodicTask.runTask(ControllerPeriodicTask.java:81) ~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a291 3878a31071ec] at org.apache.pinot.core.periodictask.BasePeriodicTask.run(BasePeriodicTask.java:150) ~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec] at org.apache.pinot.core.periodictask.BasePeriodicTask.run(BasePeriodicTask.java:135) ~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec] at org.apache.pinot.core.periodictask.PeriodicTaskScheduler.lambda$start$0(PeriodicTaskScheduler.java:87) ~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec ] at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572) ~[?:?] at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:358) ~[?:?] at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) ~[?:?] at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?] at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?] at java.base/java.lang.Thread.run(Thread.java:1583) [?:?] Caused by: org.apache.pinot.spi.utils.retry.AttemptsExceededException: Operation failed after 5 attempts at org.apache.pinot.spi.utils.retry.BaseRetryPolicy.attempt(BaseRetryPolicy.java:65) ~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec] at org.apache.pinot.common.utils.helix.HelixHelper.updateIdealState(HelixHelper.java:104) ~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec] ... 20 more ``` Also, we observe that updating segment lineage znode also fails sometimes: ``` 2024/05/16 01:59:39.041 ERROR [RetentionManager] [pool-20-thread-4] Failed to clean up the segment lineage. (tableName = tableXXX_OFFLINE) org.apache.pinot.spi.utils.retry.AttemptsExceededException: Operation failed after 5 attempts at org.apache.pinot.spi.utils.retry.BaseRetryPolicy.attempt(BaseRetryPolicy.java:65) ~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec] at org.apache.pinot.controller.helix.core.retention.RetentionManager.manageSegmentLineageCleanupForTable(RetentionManager.java:195) ~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec] at org.apache.pinot.controller.helix.core.retention.RetentionManager.processTable(RetentionManager.java:86) ~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec] at org.apache.pinot.controller.helix.core.periodictask.ControllerPeriodicTask.processTable(ControllerPeriodicTask.java:145) ~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec] at org.apache.pinot.controller.helix.core.periodictask.ControllerPeriodicTask.processTables(ControllerPeriodicTask.java:118) ~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec] at org.apache.pinot.controller.helix.core.periodictask.ControllerPeriodicTask.runTask(ControllerPeriodicTask.java:81) ~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec] at org.apache.pinot.core.periodictask.BasePeriodicTask.run(BasePeriodicTask.java:150) ~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec] at org.apache.pinot.core.periodictask.BasePeriodicTask.run(BasePeriodicTask.java:135) ~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec] at org.apache.pinot.core.periodictask.PeriodicTaskScheduler.lambda$start$0(PeriodicTaskScheduler.java:87) ~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec] at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572) ~[?:?] at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:358) ~[?:?] at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) ~[?:?] at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?] at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?] at java.base/java.lang.Thread.run(Thread.java:1583) [?:?] 2024/05/16 01:59:39.041 ERROR [ControllerPeriodicTask] [pool-20-thread-4] Caught exception while processing table: tableXXX_OFFLINE in task: RetentionManager java.lang.RuntimeException: Failed to clean up the segment lineage. (tableName = tableXXX_OFFLINE) at org.apache.pinot.controller.helix.core.retention.RetentionManager.manageSegmentLineageCleanupForTable(RetentionManager.java:237) ~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec] at org.apache.pinot.controller.helix.core.retention.RetentionManager.processTable(RetentionManager.java:86) ~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec] at org.apache.pinot.controller.helix.core.periodictask.ControllerPeriodicTask.processTable(ControllerPeriodicTask.java:145) ~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec] at org.apache.pinot.controller.helix.core.periodictask.ControllerPeriodicTask.processTables(ControllerPeriodicTask.java:118) ~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec] at org.apache.pinot.controller.helix.core.periodictask.ControllerPeriodicTask.runTask(ControllerPeriodicTask.java:81) ~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec] at org.apache.pinot.core.periodictask.BasePeriodicTask.run(BasePeriodicTask.java:150) ~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec] at org.apache.pinot.core.periodictask.BasePeriodicTask.run(BasePeriodicTask.java:135) ~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec] at org.apache.pinot.core.periodictask.PeriodicTaskScheduler.lambda$start$0(PeriodicTaskScheduler.java:87) ~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec] at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572) ~[?:?] at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:358) ~[?:?] at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) ~[?:?] at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?] at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?] at java.base/java.lang.Thread.run(Thread.java:1583) [?:?] Caused by: org.apache.pinot.spi.utils.retry.AttemptsExceededException: Operation failed after 5 attempts at org.apache.pinot.spi.utils.retry.BaseRetryPolicy.attempt(BaseRetryPolicy.java:65) ~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec] at org.apache.pinot.controller.helix.core.retention.RetentionManager.manageSegmentLineageCleanupForTable(RetentionManager.java:195) ~[startree-pinot-all-1.2.0-ST.10.1-jar-with-dependencies.jar:1.2.0-ST.10.1-8711a0aa760c824774f3ceb1a2913878a31071ec] ... 13 more ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org