Re: [PR] Add ParquetFileMerger for efficient row-group level file merging [iceberg]

via GitHub Tue, 25 Nov 2025 22:15:18 -0800


shangxinli commented on PR #14435:
URL: https://github.com/apache/iceberg/pull/14435#issuecomment-3579417286


   1. ParquetFileMerger.java 
   
     - Renamed readAndValidateSchema() → canMerge() returning boolean instead 
of MessageType
     - Added row_id null validation directly into canMerge() method
     - Added validateRowIdColumnHasNoNulls() method to check physical _row_id 
columns have no null values
     - Changed exception handling from IllegalArgumentException | IOException 
to RuntimeException | IOException to catch ParquetCryptoRuntimeException
     - Updated Javadoc examples and documentation
   
     2. ParquetUtil.java 
   
     - Added public constants COLUMN_INDEX_TRUNCATE_LENGTH and 
DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH (moved from ParquetWriter)
   
     3. ParquetWriter.java 
   
     - Removed constants (moved to ParquetUtil)
     - Updated to reference ParquetUtil constants
   
     4. TestParquetFileMerger.java 
   
     - Created comprehensive unit tests for ParquetFileMerger
     - Tests for canMerge() with various scenarios (empty lists, non-Parquet 
files, schema mismatches, etc.)
   
     5. SparkParquetFileMergeRunner.java 
   
     - Renamed validateAndGetSchema() → canMerge() returning boolean
     - Removed ValidationResult dependency
     - Simplified to use canMerge() for validation instead of returning 
schema/metadata
     - Schema and metadata now read on executor from input files (reducing 
serialization)
   
     6. TestSparkParquetFileMergeRunner.java 
   
     - Updated tests to use canMerge() instead of validateAndGetSchema()
     - Removed reflection-based testing code
     - Simplified test assertions to work with boolean return values
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add ParquetFileMerger for efficient row-group level file merging [iceberg]

Reply via email to