Re: [PR] Flink: Fix bounded source state restore record duplication [iceberg]

via GitHub Tue, 23 Apr 2024 09:41:16 -0700


stevenzwu commented on code in PR #10208:
URL: https://github.com/apache/iceberg/pull/10208#discussion_r1576552272



##########
flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/source/IcebergSource.java:
##########
@@ -201,8 +201,11 @@ private SplitEnumerator<IcebergSourceSplit, 
IcebergEnumeratorState> createEnumer
       return new ContinuousIcebergEnumerator(
           enumContext, assigner, scanContext, splitPlanner, enumState);
     } else {
-      List<IcebergSourceSplit> splits = 
planSplitsForBatch(planningThreadName());
-      assigner.onDiscoveredSplits(splits);
+      if (enumState == null) {

Review Comment:
   thanks for fixing the bug.
   
   maybe add a comment like `Only do scan planning if nothing is restored from 
checkpoint state`.



##########
flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/source/TestIcebergSourceFailover.java:
##########
@@ -149,6 +163,15 @@ private void testBoundedIcebergSource(FailoverType 
failoverType) throws Exceptio
     JobClient jobClient = env.executeAsync("Bounded Iceberg Source Failover 
Test");
     JobID jobId = jobClient.getJobID();
 
+    Path jobCheckpointDir = checkpointDir.resolve(jobId.toString());
+    Awaitility.await("Wait for some checkpoints to complete")

Review Comment:
   check for directory exist might be flaky. what if some subtask checkpointed 
its state and uploaded state file to the checkpoint dir but the whole 
checkpoint hasn't completed yet.
   
   I can think of a couple potential alternatives
   
   1. manually trigger checkpoint via `MiniCluster`
   
   ```
   public CompletableFuture<String> triggerCheckpoint(JobID jobID) {
           return runDispatcherCommand(
                   dispatcherGateway -> 
dispatcherGateway.triggerCheckpoint(jobID, rpcTimeout));
       }
   ```
   
   2. check metrics like 
`TestIcebergSourceContinuous#assertThatIcebergSourceMetricExists()`
   
   



##########
flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/source/TestIcebergSourceFailover.java:
##########
@@ -40,24 +44,27 @@
 import org.apache.iceberg.FileFormat;
 import org.apache.iceberg.Schema;
 import org.apache.iceberg.Table;
+import org.apache.iceberg.TableProperties;
 import org.apache.iceberg.data.GenericAppenderHelper;
 import org.apache.iceberg.data.RandomGenericData;
 import org.apache.iceberg.data.Record;
 import org.apache.iceberg.flink.FlinkConfigOptions;
+import org.apache.iceberg.flink.FlinkReadOptions;
 import org.apache.iceberg.flink.HadoopTableResource;
 import org.apache.iceberg.flink.SimpleDataUtil;
 import org.apache.iceberg.flink.TestFixtures;
 import org.apache.iceberg.flink.sink.FlinkSink;
 import org.apache.iceberg.flink.source.assigner.SimpleSplitAssignerFactory;
 import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.awaitility.Awaitility;
 import org.junit.ClassRule;
 import org.junit.Rule;
 import org.junit.Test;
 import org.junit.rules.TemporaryFolder;
 
 public class TestIcebergSourceFailover {
 
-  private static final int PARALLELISM = 4;
+  private static final int PARALLELISM = 2;

Review Comment:
   what's the reason of reducing parallelism?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Flink: Fix bounded source state restore record duplication [iceberg]

Reply via email to