ankitsultana opened a new issue, #11636: URL: https://github.com/apache/pinot/issues/11636
For two of our use-cases we started seeing weird segment in error state issues recently and on debugging we found that it is because of the fact that uploading offline table segments concurrently across different controllers is not safe. I won't go into the full root-cause but will add some notes: * There's a in-memory lock taken to update ideal state in the segment upload path triggered by a upload to POST /segments API. So concurrently uploading segments via the same controller should be fine. * Issue is more likely to be hit as you increase concurrency or IdealState size. * The bad segments were caused because the segment metadata was deleted but the servers had already started the OFFLINE ==> ONLINE transition. * Recovering from a bad state is hard and we had to delete segments and re-upload them to fix the situation. This exception was seen in the server: ``` Caught exception in state transition from OFFLINE -> ONLINE for resource: <table-name>, partition: <segment-name>"} java.lang.NullPointerException: null at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:882) at org.apache.pinot.server.starter.helix.HelixInstanceDataManager.addOrReplaceSegment(HelixInstanceDataManager.java:401) at org.apache.pinot.server.starter.helix.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.onBecomeOnlineFromOffline(SegmentOnlineOfflineS tateModelFactory.java:163) ``` And this was seen in the controller: ``` java.lang.RuntimeException: Caught exception while updating ideal state for resource: <table-name> at org.apache.pinot.common.utils.helix.HelixHelper.updateIdealState(HelixHelper.java:169) at org.apache.pinot.common.utils.helix.HelixHelper.updateIdealState(HelixHelper.java:193) at org.apache.pinot.controller.helix.core.PinotHelixResourceManager.assignTableSegment(PinotHelixResourceManager.java:2137) at org.apache.pinot.controller.api.upload.ZKOperator.processNewSegment(ZKOperator.java:294) at org.apache.pinot.controller.api.upload.ZKOperator.completeSegmentOperations(ZKOperator.java:82) at org.apache.pinot.controller.api.resources.PinotSegmentUploadDownloadRestletResource.uploadSegment(PinotSegmentUploadDownloadRestletResource.java:360) at org.apache.pinot.controller.api.resources.PinotSegmentUploadDownloadRestletResource.uploadSegmentAsJson(PinotSegmentUploadDownloadRestletResource.java:481) at jdk.internal.reflect.GeneratedMethodAccessor343.invoke(Unknown Source) ... Caused by: org.apache.pinot.spi.utils.retry.AttemptsExceededException: Operation failed after 20 attempts at org.apache.pinot.spi.utils.retry.BaseRetryPolicy.attempt(BaseRetryPolicy.java:65) at org.apache.pinot.common.utils.helix.HelixHelper.updateIdealState(HelixHelper.java:98) ``` The easiest solution to this problem is to use a single controller for concurrent uploads or do sequential uploads in the offline ingestion pipeline which is what we will be doing. Creating this ticket if someone is interested in doing a native fix for this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org