Re: [PR] Flink: Add IcebergSinkBuilder interface allowed unification of most of operations on FlinkSink and IcebergSink Builders [iceberg]

via GitHub Tue, 15 Oct 2024 14:35:36 -0700


stevenzwu commented on PR #11305:
URL: https://github.com/apache/iceberg/pull/11305#issuecomment-2415207097


   @pvary looks like the simple count based assertion is flaky. 
   
   ```
         for (Snapshot snapshot : rangePartitionedCycles) {
           List<DataFile> addedDataFiles =
               
Lists.newArrayList(snapshot.addedDataFiles(table.io()).iterator());
           assertThat(addedDataFiles)
               
.hasSizeLessThanOrEqualTo(maxAddedDataFilesPerCheckpoint(parallelism));
         }
       }
     }
   
     /**
      * Traffic is not perfectly balanced across all buckets in the small 
sample size Range
      * distribution of the bucket id may cross subtask boundary. Hence the 
number of committed data
      * files per checkpoint maybe larger than writer parallelism or the number 
of buckets. But it
      * should not be more than the sum of those two. Without range 
distribution, the number of data
      * files per commit can be 4x of parallelism (as the number of buckets is 
4).
      */
     private int maxAddedDataFilesPerCheckpoint(int parallelism) {
       return NUM_BUCKETS + parallelism;
     }
   ```
   
   from the exception, it seems that there were 2 data files for the same 
bucket from the same subtask, which normally shouldn't happen. Not sure why it 
could happen. Note that file name is in the format of 
`<subtaskId>-<attempId>-<uuid>-<fileCounter>`, e.g. 
`00002-0-fdca1895-4698-4c56-8b40-2dbcec41352e-00018`. I think we need to change 
the assertion to more sophisticated one - check range doesn't overlap.
   ```
           GenericDataFile{content=data, 
file_path=file:/tmp/junit5_hadoop_catalog-5157340844762714796/3664f021-0e50-4a7a-8048-93cdf3c3e236/default/t/data/ts_hour=2024-10-14-16/uuid_bucket=3/00002-0-fdca1895-4698-4c56-8b40-2dbcec41352e-00018.parquet,
 file_format=PARQUET, spec_id=0, partition=PartitionData{ts_hour=480256, 
uuid_bucket=3}, record_count=15, file_size_in_bytes=1295, 
column_sizes=org.apache.iceberg.util.SerializableMap@1d0, 
value_counts=org.apache.iceberg.util.SerializableMap@27, 
null_value_counts=org.apache.iceberg.util.SerializableMap@6, 
nan_value_counts=org.apache.iceberg.util.SerializableMap@0, 
lower_bounds=org.apache.iceberg.SerializableByteBufferMap@11e0aa58, 
upper_bounds=org.apache.iceberg.SerializableByteBufferMap@de87e38d, 
key_metadata=null, split_offsets=[4], equality_ids=null, sort_order_id=0, 
data_sequence_number=7, file_sequence_number=7},
           GenericDataFile{content=data, 
file_path=file:/tmp/junit5_hadoop_catalog-5157340844762714796/3664f021-0e50-4a7a-8048-93cdf3c3e236/default/t/data/ts_hour=2024-10-14-16/uuid_bucket=3/00002-0-fdca1895-4698-4c56-8b40-2dbcec41352e-00019.parquet,
 file_format=PARQUET, spec_id=0, partition=PartitionData{ts_hour=480256, 
uuid_bucket=3}, record_count=2, file_size_in_bytes=1034, 
column_sizes=org.apache.iceberg.util.SerializableMap@cb, 
value_counts=org.apache.iceberg.util.SerializableMap@4, 
null_value_counts=org.apache.iceberg.util.SerializableMap@6, 
nan_value_counts=org.apache.iceberg.util.SerializableMap@0, 
lower_bounds=org.apache.iceberg.SerializableByteBufferMap@2ad8d1c4, 
upper_bounds=org.apache.iceberg.SerializableByteBufferMap@b07e8329, 
key_metadata=null, split_offsets=[4], equality_ids=null, sort_order_id=0, 
data_sequence_number=7, file_sequence_number=7}]
         [GenericDataFile{content=data, 
file_path=file:/tmp/junit5_hadoop_catalog-5157340844762714796/3664f021-0e50-4a7a-8048-93cdf3c3e236/default/t/data/ts_hour=2024-10-14-16/uuid_bucket=2/00002-0-fdca1895-4698-4c56-8b40-2dbcec41352e-00017.parquet,
 file_format=PARQUET, spec_id=0, partition=PartitionData{ts_hour=480256, 
uuid_bucket=2}, record_count=6, file_size_in_bytes=1119, 
column_sizes=org.apache.iceberg.util.SerializableMap@11a, 
value_counts=org.apache.iceberg.util.SerializableMap@10, 
null_value_counts=org.apache.iceberg.util.SerializableMap@6, 
nan_value_counts=org.apache.iceberg.util.SerializableMap@0, 
lower_bounds=org.apache.iceberg.SerializableByteBufferMap@f3c0445f, 
upper_bounds=org.apache.iceberg.SerializableByteBufferMap@3fd2d32a, 
key_metadata=null, split_offsets=[4], equality_ids=null, sort_order_id=0, 
data_sequence_number=7, file_sequence_number=7},
   
           GenericDataFile{content=data, 
file_path=file:/tmp/junit5_hadoop_catalog-5157340844762714796/3664f021-0e50-4a7a-8048-93cdf3c3e236/default/t/data/ts_hour=2024-10-14-16/uuid_bucket=2/00001-0-c7968b11-96c1-4bbb-a63f-e56d47feeef3-00017.parquet,
 file_format=PARQUET, spec_id=0, partition=PartitionData{ts_hour=480256, 
uuid_bucket=2}, record_count=10, file_size_in_bytes=1199, 
column_sizes=org.apache.iceberg.util.SerializableMap@170, 
value_counts=org.apache.iceberg.util.SerializableMap@1c, 
null_value_counts=org.apache.iceberg.util.SerializableMap@6, 
nan_value_counts=org.apache.iceberg.util.SerializableMap@0, 
lower_bounds=org.apache.iceberg.SerializableByteBufferMap@bf77a452, 
upper_bounds=org.apache.iceberg.SerializableByteBufferMap@8c8e0666, 
key_metadata=null, split_offsets=[4], equality_ids=null, sort_order_id=0, 
data_sequence_number=7, file_sequence_number=7},
           GenericDataFile{content=data, 
file_path=file:/tmp/junit5_hadoop_catalog-5157340844762714796/3664f021-0e50-4a7a-8048-93cdf3c3e236/default/t/data/ts_hour=2024-10-14-16/uuid_bucket=2/00001-0-c7968b11-96c1-4bbb-a63f-e56d47feeef3-00019.parquet,
 file_format=PARQUET, spec_id=0, partition=PartitionData{ts_hour=480256, 
uuid_bucket=2}, record_count=1, file_size_in_bytes=1021, 
column_sizes=org.apache.iceberg.util.SerializableMap@97, 
value_counts=org.apache.iceberg.util.SerializableMap@5, 
null_value_counts=org.apache.iceberg.util.SerializableMap@6, 
nan_value_counts=org.apache.iceberg.util.SerializableMap@0, 
lower_bounds=org.apache.iceberg.SerializableByteBufferMap@731c7f34, 
upper_bounds=org.apache.iceberg.SerializableByteBufferMap@731c7f34, 
key_metadata=null, split_offsets=[4], equality_ids=null, sort_order_id=0, 
data_sequence_number=7, file_sequence_number=7},
           GenericDataFile{content=data, 
file_path=file:/tmp/junit5_hadoop_catalog-5157340844762714796/3664f021-0e50-4a7a-8048-93cdf3c3e236/default/t/data/ts_hour=2024-10-14-16/uuid_bucket=1/00001-0-c7968b11-96c1-4bbb-a63f-e56d47feeef3-00018.parquet,
 file_format=PARQUET, spec_id=0, partition=PartitionData{ts_hour=480256, 
uuid_bucket=1}, record_count=8, file_size_in_bytes=1159, 
column_sizes=org.apache.iceberg.util.SerializableMap@148, 
value_counts=org.apache.iceberg.util.SerializableMap@1e, 
null_value_counts=org.apache.iceberg.util.SerializableMap@6, 
nan_value_counts=org.apache.iceberg.util.SerializableMap@0, 
lower_bounds=org.apache.iceberg.SerializableByteBufferMap@7ec5775a, 
upper_bounds=org.apache.iceberg.SerializableByteBufferMap@1515b535, 
key_metadata=null, split_offsets=[4], equality_ids=null, sort_order_id=0, 
data_sequence_number=7, file_sequence_number=7},
   
           GenericDataFile{content=data, 
file_path=file:/tmp/junit5_hadoop_catalog-5157340844762714796/3664f021-0e50-4a7a-8048-93cdf3c3e236/default/t/data/ts_hour=2024-10-14-16/uuid_bucket=1/00000-0-52445ce7-8fd1-4367-9433-5e8bc792886d-00017.parquet,
 file_format=PARQUET, spec_id=0, partition=PartitionData{ts_hour=480256, 
uuid_bucket=1}, record_count=17, file_size_in_bytes=1336, 
column_sizes=org.apache.iceberg.util.SerializableMap@1f9, 
value_counts=org.apache.iceberg.util.SerializableMap@35, 
null_value_counts=org.apache.iceberg.util.SerializableMap@6, 
nan_value_counts=org.apache.iceberg.util.SerializableMap@0, 
lower_bounds=org.apache.iceberg.SerializableByteBufferMap@6739cb65, 
upper_bounds=org.apache.iceberg.SerializableByteBufferMap@19de48bf, 
key_metadata=null, split_offsets=[4], equality_ids=null, sort_order_id=0, 
data_sequence_number=7, file_sequence_number=7},
           GenericDataFile{content=data, 
file_path=file:/tmp/junit5_hadoop_catalog-5157340844762714796/3664f021-0e50-4a7a-8048-93cdf3c3e236/default/t/data/ts_hour=2024-10-14-16/uuid_bucket=0/00000-0-52445ce7-8fd1-4367-9433-5e8bc792886d-00018.parquet,
 file_format=PARQUET, spec_id=0, partition=PartitionData{ts_hour=480256, 
uuid_bucket=0}, record_count=18, file_size_in_bytes=1353, 
column_sizes=org.apache.iceberg.util.SerializableMap@20e, 
value_counts=org.apache.iceberg.util.SerializableMap@34, 
null_value_counts=org.apache.iceberg.util.SerializableMap@6, 
nan_value_counts=org.apache.iceberg.util.SerializableMap@0, 
lower_bounds=org.apache.iceberg.SerializableByteBufferMap@95e6e88b, 
upper_bounds=org.apache.iceberg.SerializableByteBufferMap@80a19270, 
key_metadata=null, split_offsets=[4], equality_ids=null, sort_order_id=0, 
data_sequence_number=7, file_sequence_number=7},
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Flink: Add IcebergSinkBuilder interface allowed unification of most of operations on FlinkSink and IcebergSink Builders [iceberg]

Reply via email to