Re: [PR] Fixes all race conditions from stream consumer [pinot]

via GitHub Wed, 29 Oct 2025 01:11:53 -0700


9aman commented on code in PR #17089:
URL: https://github.com/apache/pinot/pull/17089#discussion_r2471982638



##########
pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/RealtimeTableDataManager.java:
##########
@@ -350,6 +378,18 @@ public List<SegmentContext> 
getSegmentContexts(List<IndexSegment> selectedSegmen
     return segmentContexts;
   }
 
+  public StreamMetadataProvider 
getStreamMetadataProvider(RealtimeSegmentDataManager 
realtimeSegmentDataManager) {
+    String tableStreamName = realtimeSegmentDataManager.getTableStreamName();
+    StreamConsumerFactory streamConsumerFactory = 
realtimeSegmentDataManager.getStreamConsumerFactory();
+    try {
+      return _streamMetadataProviderCache.get(tableStreamName,

Review Comment:
   Can we add a comment here that the stream metadata provider created here is 
a synchronized one and hence it;s thread safe. 
   This is a bit different from the traditional metadata provider and hence the 
calls might get blocked. 



##########
pinot-plugins/pinot-stream-ingestion/pinot-kafka-2.0/src/main/java/org/apache/pinot/plugin/stream/kafka20/SynchronizedKafkaStreamMetadataProvider.java:
##########
@@ -0,0 +1,39 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.plugin.stream.kafka20;
+
+import java.util.Map;
+import java.util.Set;
+import org.apache.pinot.spi.stream.StreamConfig;
+import org.apache.pinot.spi.stream.StreamPartitionMsgOffset;
+
+
+public class SynchronizedKafkaStreamMetadataProvider extends 
KafkaStreamMetadataProvider {

Review Comment:
   Please add java doc. Maybe a brief on the rationale behind adding this and 
the scenarios it should be used in.



##########
pinot-server/src/main/java/org/apache/pinot/server/api/resources/DebugResource.java:
##########
@@ -209,17 +216,34 @@ private long getSegmentSize(SegmentDataManager 
segmentDataManager) {
         .getSegment()).getSegmentSizeBytes() : 0;
   }
 
-  private SegmentConsumerInfo getSegmentConsumerInfo(SegmentDataManager 
segmentDataManager, TableType tableType) {
+  private SegmentConsumerInfo getSegmentConsumerInfo(TableDataManager 
tableDataManager,
+      SegmentDataManager segmentDataManager, TableType tableType) {
     SegmentConsumerInfo segmentConsumerInfo = null;
     if (tableType == TableType.REALTIME) {
       RealtimeSegmentDataManager realtimeSegmentDataManager = 
(RealtimeSegmentDataManager) segmentDataManager;
-      Map<String, ConsumerPartitionState> partitionStateMap = 
realtimeSegmentDataManager.getConsumerPartitionState();
+      StreamMetadataProvider streamMetadataProvider =

Review Comment:
   @KKcorps this seems to be correctly using _streamPartitionId
   
   ```
       if (numStreams == 1) {
         // Single stream
         // NOTE: We skip partition id translation logic to handle cases where 
custom stream might return partition id
         // larger than 10000.
         _streamPartitionId = _partitionGroupId;
         _streamConfig = new StreamConfig(_tableNameWithType, 
streamConfigMaps.get(0));
       } else {
         // Multiple streams
         _streamPartitionId = 
IngestionConfigUtils.getStreamPartitionIdFromPinotPartitionId(_partitionGroupId);
         int index = 
IngestionConfigUtils.getStreamConfigIndexFromPinotPartitionId(_partitionGroupId);
         Preconditions.checkState(numStreams > index, "Cannot find stream 
config of index: %s for table: %s", index,
             _tableNameWithType);
         _streamConfig = new StreamConfig(_tableNameWithType, 
streamConfigMaps.get(index));
       }
       _streamConsumerFactory = 
StreamConsumerFactoryProvider.create(_streamConfig);
   ```
   
   The existing code also relies on _partitionGroupId that is set based on the 
segment name to fetch the latest offset. 
   
   @noob-se7en please verify  whether the existing code also has any concerns.
   @KKcorps I feel the new code has a similar behavior to that of the previous 
code. 
   
   Please correct me if wrong.



##########
pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/RealtimeSegmentDataManager.java:
##########
@@ -1082,21 +1081,12 @@ public long getLastConsumedTimestamp() {
   /**
    * Returns the {@link ConsumerPartitionState} for the partition group.
    */
-  public Map<String, ConsumerPartitionState> getConsumerPartitionState() {
+  public Map<String, ConsumerPartitionState> getConsumerPartitionState(
+      @Nullable StreamPartitionMsgOffset latestMsgOffset) {
     String partitionGroupId = String.valueOf(_partitionGroupId);
-    return Collections.singletonMap(partitionGroupId, new 
ConsumerPartitionState(partitionGroupId, getCurrentOffset(),
-        getLastConsumedTimestamp(), fetchLatestStreamOffset(5_000), 
_lastRowMetadata));
-  }
-
-  /**
-   * Returns the {@link PartitionLagState} for the partition group.
-   */
-  public Map<String, PartitionLagState> getPartitionToLagState(

Review Comment:
   Ohh, I see. 
   So we have segregated the provider based on the access patter i.e. 
concurrent access or not concurrent access. 
   
   We go with the normal in case of RealtimeSegmentDataManager and concurrent 
otherwise ?
   



##########
pinot-plugins/pinot-stream-ingestion/pinot-kafka-3.0/src/main/java/org/apache/pinot/plugin/stream/kafka30/SynchronizedKafkaStreamMetadataProvider.java:
##########
@@ -0,0 +1,39 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.plugin.stream.kafka30;
+
+import java.util.Map;
+import java.util.Set;
+import org.apache.pinot.spi.stream.StreamConfig;
+import org.apache.pinot.spi.stream.StreamPartitionMsgOffset;
+
+
+public class SynchronizedKafkaStreamMetadataProvider extends 
KafkaStreamMetadataProvider {

Review Comment:
   Same as above. 



##########
pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/RealtimeTableDataManager.java:
##########
@@ -248,6 +259,23 @@ public boolean getAsBoolean() {
     }
   }
 
+  @VisibleForTesting
+  protected Cache<String, StreamMetadataProvider> 
getStreamMetadataProviderCache() {
+    return CacheBuilder.newBuilder()
+        .expireAfterAccess(STREAM_METADATA_PROVIDER_CACHE_TTL)
+        .removalListener((RemovalNotification<String, StreamMetadataProvider> 
notification) -> {
+          StreamMetadataProvider provider = notification.getValue();

Review Comment:
   Is there a way, similar to RealtimeSegmentDataManager, to invalidate the 
cache in case we run into multiple transient errors ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Fixes all race conditions from stream consumer [pinot]

Reply via email to