[GitHub] [pinot] ankitsultana opened a new issue, #11636: Concurrent Offline Table Segment Uploads Can Lead to Error State

via GitHub Wed, 20 Sep 2023 13:23:51 -0700


ankitsultana opened a new issue, #11636:
URL: https://github.com/apache/pinot/issues/11636


   For two of our use-cases we started seeing weird segment in error state 
issues recently and on debugging we found that it is because of the fact that 
uploading offline table segments concurrently across different controllers is 
not safe.
   
   I won't go into the full root-cause but will add some notes:
   
   * There's a in-memory lock taken to update ideal state in the segment upload 
path triggered by a upload to POST /segments API. So concurrently uploading 
segments via the same controller should be fine.
   * Issue is more likely to be hit as you increase concurrency or IdealState 
size.
   * The bad segments were caused because the segment metadata was deleted but 
the servers had already started the OFFLINE ==> ONLINE transition.
   * Recovering from a bad state is hard and we had to delete segments and 
re-upload them to fix the situation.
   
   This exception was seen in the server:
   
   ```
   Caught exception in state transition from OFFLINE -> ONLINE for resource: 
<table-name>, partition: <segment-name>"}
   java.lang.NullPointerException: null
           at 
com.google.common.base.Preconditions.checkNotNull(Preconditions.java:882)
           at 
org.apache.pinot.server.starter.helix.HelixInstanceDataManager.addOrReplaceSegment(HelixInstanceDataManager.java:401)
           at 
org.apache.pinot.server.starter.helix.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.onBecomeOnlineFromOffline(SegmentOnlineOfflineS
   tateModelFactory.java:163)
   ```
   
   And this was seen in the controller:
   
   ```
   java.lang.RuntimeException: Caught exception while updating ideal state for 
resource: <table-name>
           at 
org.apache.pinot.common.utils.helix.HelixHelper.updateIdealState(HelixHelper.java:169)
           at 
org.apache.pinot.common.utils.helix.HelixHelper.updateIdealState(HelixHelper.java:193)
           at 
org.apache.pinot.controller.helix.core.PinotHelixResourceManager.assignTableSegment(PinotHelixResourceManager.java:2137)
           at 
org.apache.pinot.controller.api.upload.ZKOperator.processNewSegment(ZKOperator.java:294)
           at 
org.apache.pinot.controller.api.upload.ZKOperator.completeSegmentOperations(ZKOperator.java:82)
           at 
org.apache.pinot.controller.api.resources.PinotSegmentUploadDownloadRestletResource.uploadSegment(PinotSegmentUploadDownloadRestletResource.java:360)
           at 
org.apache.pinot.controller.api.resources.PinotSegmentUploadDownloadRestletResource.uploadSegmentAsJson(PinotSegmentUploadDownloadRestletResource.java:481)
           at jdk.internal.reflect.GeneratedMethodAccessor343.invoke(Unknown 
Source)
           ...
   Caused by: org.apache.pinot.spi.utils.retry.AttemptsExceededException: 
Operation failed after 20 attempts
           at 
org.apache.pinot.spi.utils.retry.BaseRetryPolicy.attempt(BaseRetryPolicy.java:65)
           at 
org.apache.pinot.common.utils.helix.HelixHelper.updateIdealState(HelixHelper.java:98)
   ```
   
   The easiest solution to this problem is to use a single controller for 
concurrent uploads or do sequential uploads in the offline ingestion pipeline 
which is what we will be doing. Creating this ticket if someone is interested 
in doing a native fix for this.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

[GitHub] [pinot] ankitsultana opened a new issue, #11636: Concurrent Offline Table Segment Uploads Can Lead to Error State

Reply via email to