swaminathanmanish commented on PR #13036:
URL: https://github.com/apache/pinot/pull/13036#issuecomment-2087081060

   > We have intermittently seen issues in our clusters while creating 
streamMessageDecoder. Stack trace:
   > 
   > ```
   > java.lang.RuntimeException: Caught exception while creating 
StreamMessageDecoder from stream config: 
   > StreamConfig{_type='kafka', _topicName='<redacted>', 
_consumerTypes=[LOWLEVEL], _consumerFactoryClassName='redacted>', 
   > _offsetCriteria='OffsetCriteria{_offsetType=LARGEST, 
_offsetString='largest'}', _connectionTimeoutMillis=30000, 
_fetchTimeoutMillis=5000, _idleTimeoutMillis=180000, 
_flushThresholdRows=80000000, 
   > _flushThresholdTimeMillis=86400000, 
_flushSegmentDesiredSizeBytes=209715200, _flushAutotuneInitialRows=100000, 
_decoderClass='redacted', _decoderProperties={}, _groupId='null', 
_topicConsumptionRateLimit=-1.0, 
   > _tableNameWithType='redacted'}
   > at 
org.apache.pinot.spi.stream.StreamDecoderProvider.create(StreamDecoderProvider.java:48)
   > at 
org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager.(LLRealtimeSegmentDataManager.java:1424)
   > at 
org.apache.pinot.core.data.manager.realtime.RealtimeTableDataManager.addSegment(RealtimeTableDataManager.java:446)
   > at 
org.apache.pinot.server.starter.helix.HelixInstanceDataManager.addRealtimeSegment(HelixInstanceDataManager.java:228)
   > at 
org.apache.pinot.server.starter.helix.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.onBecomeOnlineFromOffline(SegmentOnlineOfflineStateModelFactory.java:168)
   > at 
org.apache.pinot.server.starter.helix.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.onBecomeConsumingFromOffline(SegmentOnlineOfflineStateModelFactory.java:83)
   > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)
   > at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   > at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   > at java.base/java.lang.reflect.Method.invoke(Method.java:566)
   > at 
org.apache.helix.messaging.handling.HelixStateTransitionHandler.invoke(HelixStateTransitionHandler.java:350)
   > at 
org.apache.helix.messaging.handling.HelixStateTransitionHandler.handleMessage(HelixStateTransitionHandler.java:278)
   > at org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:97)
   > at org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:49)
   > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
   > at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
   > at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
   > at java.base/java.lang.Thread.run(Thread.java:829)
   > ```
   > 
   > This stops consumption in one of the replicas and once the other replica 
starts committing, this stopped replica always ends up in ERROR state. The only 
way to fix this is to reset this replica's segment. The behaviour of not 
consuming in one replica is also dangerous as if the other replica's hosts 
restarts / goes down due to any reason, it can cause data loss scenarios.
   > 
   > Having a retry policy during StreamMessageDecoder.create() may help reduce 
the chances of such scenarios.
   
   What kind of failures are these ? Are all these transient failures that are 
likely to go away on retries?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to