amogh-jahagirdar commented on code in PR #13310: URL: https://github.com/apache/iceberg/pull/13310#discussion_r2197992704
########## spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java: ########## @@ -260,18 +260,52 @@ public MetadataColumn[] metadataColumns() { DataType sparkPartitionType = SparkSchemaUtil.convert(Partitioning.partitionType(table())); ImmutableList.Builder<SparkMetadataColumn> metadataColumns = ImmutableList.builder(); metadataColumns.add( - new SparkMetadataColumn(MetadataColumns.SPEC_ID.name(), DataTypes.IntegerType, true), - new SparkMetadataColumn(MetadataColumns.PARTITION_COLUMN_NAME, sparkPartitionType, true), - new SparkMetadataColumn(MetadataColumns.FILE_PATH.name(), DataTypes.StringType, false), - new SparkMetadataColumn(MetadataColumns.ROW_POSITION.name(), DataTypes.LongType, false), - new SparkMetadataColumn(MetadataColumns.IS_DELETED.name(), DataTypes.BooleanType, false)); - - if (TableUtil.formatVersion(table()) >= 3) { + SparkMetadataColumn.builder() + .name(MetadataColumns.SPEC_ID.name()) + .dataType(DataTypes.IntegerType) + .withNullability(true) + .build(), + SparkMetadataColumn.builder() + .name(MetadataColumns.PARTITION_COLUMN_NAME) + .dataType(sparkPartitionType) + .withNullability(true) + .build(), + SparkMetadataColumn.builder() + .name(MetadataColumns.FILE_PATH.name()) + .dataType(DataTypes.StringType) + .withNullability(false) + .build(), + SparkMetadataColumn.builder() + .name(MetadataColumns.ROW_POSITION.name()) + .dataType(DataTypes.LongType) + .withNullability(false) + .build(), + SparkMetadataColumn.builder() + .name(MetadataColumns.IS_DELETED.name()) + .dataType(DataTypes.BooleanType) + .withNullability(false) + .build()); + + if (TableUtil.supportsRowLineage(icebergTable)) { metadataColumns.add( - new SparkMetadataColumn(MetadataColumns.ROW_ID.name(), DataTypes.LongType, true)); + SparkMetadataColumn.builder() + .name(MetadataColumns.ROW_ID.name()) + .dataType(DataTypes.LongType) + .withNullability(true) + .preserveOnReinsert(true) + .preserveOnUpdate(true) + .preserveOnDelete(false) + .build()); + metadataColumns.add( - new SparkMetadataColumn( - MetadataColumns.LAST_UPDATED_SEQUENCE_NUMBER.name(), DataTypes.LongType, true)); + SparkMetadataColumn.builder() + .name(MetadataColumns.LAST_UPDATED_SEQUENCE_NUMBER.name()) + .dataType(DataTypes.LongType) + .withNullability(true) + .preserveOnReinsert(false) Review Comment: Yes for compaction or any kind of rewrite we do need to preserve both row ids/seq numbers however that is a different code path that we'd need to handle separately. These options for preserving metadata column values on reinsert/update/delete were new options added in DSV2 specifically for DML operations like merge/update/delete. For compaction we'll need to project the row lineage fields when reading the files https://github.com/amogh-jahagirdar/iceberg/blob/master/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/actions/SparkBinPackFileRewriteRunner.java#L44 (and in any other rewriting logic) and make sure they are persisted on write. The trick here will be on write, we'll need a way to bypass any analyzer rules that tries to check alignment between the table schema (which doesn't include lineage fields) and the data frame write schema. I'm still looking through options here (do we need some kind of custom rule which based on the existence of the "REWRITTEN_FILE_SCAN_TASK_SET_ID" option will hijack the DSV2 relation with lineage like we do in 3.4/3.5 DML rules or can we somehow conditionally leverage https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCapability.java#L94 ) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org