stevenzwu commented on code in PR #10208: URL: https://github.com/apache/iceberg/pull/10208#discussion_r1576552272
########## flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/source/IcebergSource.java: ########## @@ -201,8 +201,11 @@ private SplitEnumerator<IcebergSourceSplit, IcebergEnumeratorState> createEnumer return new ContinuousIcebergEnumerator( enumContext, assigner, scanContext, splitPlanner, enumState); } else { - List<IcebergSourceSplit> splits = planSplitsForBatch(planningThreadName()); - assigner.onDiscoveredSplits(splits); + if (enumState == null) { Review Comment: thanks for fixing the bug. maybe add a comment like `Only do scan planning if nothing is restored from checkpoint state`. ########## flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/source/TestIcebergSourceFailover.java: ########## @@ -149,6 +163,15 @@ private void testBoundedIcebergSource(FailoverType failoverType) throws Exceptio JobClient jobClient = env.executeAsync("Bounded Iceberg Source Failover Test"); JobID jobId = jobClient.getJobID(); + Path jobCheckpointDir = checkpointDir.resolve(jobId.toString()); + Awaitility.await("Wait for some checkpoints to complete") Review Comment: check for directory exist might be flaky. what if some subtask checkpointed its state and uploaded state file to the checkpoint dir but the whole checkpoint hasn't completed yet. I can think of a couple potential alternatives 1. manually trigger checkpoint via `MiniCluster` ``` public CompletableFuture<String> triggerCheckpoint(JobID jobID) { return runDispatcherCommand( dispatcherGateway -> dispatcherGateway.triggerCheckpoint(jobID, rpcTimeout)); } ``` 2. check metrics like `TestIcebergSourceContinuous#assertThatIcebergSourceMetricExists()` ########## flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/source/TestIcebergSourceFailover.java: ########## @@ -40,24 +44,27 @@ import org.apache.iceberg.FileFormat; import org.apache.iceberg.Schema; import org.apache.iceberg.Table; +import org.apache.iceberg.TableProperties; import org.apache.iceberg.data.GenericAppenderHelper; import org.apache.iceberg.data.RandomGenericData; import org.apache.iceberg.data.Record; import org.apache.iceberg.flink.FlinkConfigOptions; +import org.apache.iceberg.flink.FlinkReadOptions; import org.apache.iceberg.flink.HadoopTableResource; import org.apache.iceberg.flink.SimpleDataUtil; import org.apache.iceberg.flink.TestFixtures; import org.apache.iceberg.flink.sink.FlinkSink; import org.apache.iceberg.flink.source.assigner.SimpleSplitAssignerFactory; import org.apache.iceberg.relocated.com.google.common.collect.Lists; +import org.awaitility.Awaitility; import org.junit.ClassRule; import org.junit.Rule; import org.junit.Test; import org.junit.rules.TemporaryFolder; public class TestIcebergSourceFailover { - private static final int PARALLELISM = 4; + private static final int PARALLELISM = 2; Review Comment: what's the reason of reducing parallelism? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org