bk-mz opened a new issue, #9923: URL: https://github.com/apache/iceberg/issues/9923
### Apache Iceberg version 1.4.3 (latest release) ### Query engine Spark ### Please describe the bug 🐞 When calling maintenance procedure `rewrite_position_delete_files`: ``` CALL glue.system.rewrite_position_delete_files(table => 'table', where => "data_load_ts between TIMESTAMP '2024-02-07 13:51:58.729' and TIMESTAMP '2024-03-08 12:51:58.729'", options => map('partial-progress.enabled', 'true', 'min-file-size-bytes', '26843545', 'max-file-size-bytes', '134217728', 'min-input-files', '5', 'max-concurrent-file-group-rewrites', '50')) ``` We observe following exception: ``` java.lang.IllegalArgumentException: Multiple entries with same key: 1000=row.struct.substruct.substruct_field and 1000=partition.data_load_ts_hour at org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap.conflictException(ImmutableMap.java:378) ~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?] at org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap.checkNoConflict(ImmutableMap.java:372) ~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?] at org.apache.iceberg.relocated.com.google.common.collect.RegularImmutableMap.checkNoConflictInKeyBucket(RegularImmutableMap.java:246) ~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?] at org.apache.iceberg.relocated.com.google.common.collect.RegularImmutableMap.fromEntryArrayCheckingBucketOverflow(RegularImmutableMap.java:133) ~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?] at org.apache.iceberg.relocated.com.google.common.collect.RegularImmutableMap.fromEntryArray(RegularImmutableMap.java:95) ~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?] at org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap$Builder.build(ImmutableMap.java:572) ~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?] at org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap$Builder.buildOrThrow(ImmutableMap.java:600) ~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?] at org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap$Builder.build(ImmutableMap.java:587) ~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?] at org.apache.iceberg.types.IndexByName.byId(IndexByName.java:81) ~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?] at org.apache.iceberg.types.TypeUtil.indexNameById(TypeUtil.java:172) ~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?] at org.apache.iceberg.Schema.lazyIdToName(Schema.java:183) ~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?] at org.apache.iceberg.Schema.<init>(Schema.java:112) ~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?] at org.apache.iceberg.Schema.<init>(Schema.java:91) ~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?] at org.apache.iceberg.Schema.<init>(Schema.java:87) ~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?] at org.apache.iceberg.Schema.<init>(Schema.java:160) ~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?] at org.apache.iceberg.PositionDeletesTable.calculateSchema(PositionDeletesTable.java:129) ~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?] at org.apache.iceberg.PositionDeletesTable.<init>(PositionDeletesTable.java:62) ~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?] at org.apache.iceberg.MetadataTableUtils.createMetadataTableInstance(MetadataTableUtils.java:81) ~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?] ``` ## Steps to reproduce Problem is reproducible locally: ``` import org.apache.iceberg._ import org.apache.iceberg.aws.glue._ import org.apache.iceberg.catalog._ import org.apache.iceberg.types._ val glue = new GlueCatalog(); glue.initialize("glue", new java.util.HashMap()); val table = glue.loadTable(TableIdentifier.parse("db.table")); val partitionType = Partitioning.partitionType(table); val result = new Schema( Types.NestedField.optional( MetadataColumns.DELETE_FILE_ROW_FIELD_ID, MetadataColumns.DELETE_FILE_ROW_FIELD_NAME, table.schema().asStruct(), MetadataColumns.DELETE_FILE_ROW_DOC), Types.NestedField.required( MetadataColumns.PARTITION_COLUMN_ID, PositionDeletesTable.PARTITION, Partitioning.partitionType(table), "Partition that position delete row belongs to")); ``` ## Triage Triaging the bug, it's seen that running `partitionType` function on a table results in creating the struct with id:1000: ``` scala> val partitionType = Partitioning.partitionType(table); partitionType: org.apache.iceberg.types.Types.StructType = struct<1000: data_load_ts_hour: optional int> ``` Further down the code, when creating schema for `PositionDeleteTable`, exception is thrown on evaluating: ``` this.highestFieldId = lazyIdToName().keySet().stream().mapToInt(i -> i).max().orElse(0); ``` ``` public Map<Integer, String> byId() { ImmutableMap.Builder<Integer, String> builder = ImmutableMap.builder(); nameToId.forEach((key, value) -> builder.put(value, key)); // <-- HERE MAP IS IMMUTABLE return builder.build(); } ``` Triaging further, `PositionDeletesTable` creates new schema joining existing table schema and partition spec as struct type. `PartitionSpec` would set that struct type a constant id 1000: ```java private int nextFieldId() { return lastAssignedFieldId.incrementAndGet(); } where private final AtomicInteger lastAssignedFieldId = new AtomicInteger(unpartitionedLastAssignedId()); where private static int unpartitionedLastAssignedId() { return PARTITION_DATA_ID_START - 1; } ``` In our case, `hour` partitioning would invoke `nextFieldId()`: ```java PartitionField field = new PartitionField(sourceColumn.fieldId(), nextFieldId(), targetName, Transforms.hour()); ``` Then in turn when joining table schema with more than 1K identifiers and partition spec with 1000: id, adding value to immutable map would fail. # Possible solutions Change `Partitioning.partitionType(table)` method, so that it will allways return structs with ids that are following max field id: ```java public static StructType partitionType(Table table) { Collection<PartitionSpec> specs = table.specs().values(); int highestFieldId = table.schema().highestFieldId(); List<NestedField> sortedStructFields = buildPartitionNestedFields("table partition", specs, allFieldIds(specs)); return StructType.of(sortedStructFields.stream().map(f -> NestedField.optional(highestFieldId + f.fieldId(), f.name(), f.type())).collect(Collectors.toList())); } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org