Re: [PR] Add ParquetFileMerger for efficient row-group level file merging [iceberg]

via GitHub Sun, 16 Nov 2025 21:13:12 -0800


Guosmilesmile commented on PR #14435:
URL: https://github.com/apache/iceberg/pull/14435#issuecomment-3540014818


   > > Thanks for the PR ! I have a question about the lineage, If the merging 
is only performed at the parquet layer, will the lineage information of the v3 
table be disrupted?
   > 
   > Good question! The lineage information for v3 tables is preserved in two 
ways:
   > 
   > 1. Field IDs (Schema Lineage)
   > 
   > Field IDs are preserved because we strictly enforce identical schemas 
across all files being merged.
   > 
   > In ParquetFileMerger.java:130-136, we validate that all input files have 
exactly the same Parquet MessageType schema:
   > 
   > if (!schema.equals(currentSchema)) { throw new IllegalArgumentException( 
String.format("Schema mismatch detected: file '%s' has schema %s but file '%s' 
has schema %s. " + "All files must have identical Parquet schemas for row-group 
level merging.", ...)); }
   > 
   > Field IDs are stored directly in the Parquet schema structure itself (via 
Type.getId()), so when we copy row groups using ParquetFileWriter.appendFile() 
with the validated schema, all field IDs are preserved.
   > 
   > 2. Row IDs (Row Lineage for v3+)
   > 
   > Row IDs are automatically assigned by Iceberg's commit framework - we 
don't need special handling in the merger.
   > 
   > Here's how it works:
   > 
   > 1. Our code creates DataFile objects with metrics (including recordCount) 
but without firstRowId - see SparkParquetFileMergeRunner.java:236-243
   > 2. During commit, SnapshotProducer creates a ManifestListWriter 
initialized with base.nextRowId() (the table's current row ID counter) - see 
SnapshotProducer.java:273
   > 3. ManifestListWriter.prepare() automatically assigns firstRowId to each 
manifest and increments the counter by the number of rows - see 
ManifestListWriter.java:136-140:
   >    // assign first-row-id and update the next to assign
   >    wrapper.wrap(manifest, nextRowId);
   >    this.nextRowId += manifest.existingRowsCount() + 
manifest.addedRowsCount();
   > 4. The snapshot is committed with the updated nextRowId, ensuring all row 
IDs are correctly tracked
   > 
   > This is the same mechanism used by all Iceberg write operations, so row 
lineage is fully preserved for v3 tables.
   
   If we continue to rewire data files across three snapshots, and none of the 
data files contain row IDs, then according to my understanding, the rewritten 
output will be a single data file whose data may not be arranged in commit 
order—for example, snapshot-1, snapshot-3, snapshot-2. If the original row IDs 
are not written into the data file, the lineage tracking will be incorrect. 
Additionally, if rewireDataFile is run concurrently under different groups 
without writing the row IDs, that will also cause problems.
   
   This is my current concern about whether lineage can be preserved when 
merging only at the Parquet level.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add ParquetFileMerger for efficient row-group level file merging [iceberg]

Reply via email to