shangxinli commented on PR #14435:
URL: https://github.com/apache/iceberg/pull/14435#issuecomment-3579417286
1. ParquetFileMerger.java
- Renamed readAndValidateSchema() → canMerge() returning boolean instead
of MessageType
- Added row_id null validation directly into canMerge() method
- Added validateRowIdColumnHasNoNulls() method to check physical _row_id
columns have no null values
- Changed exception handling from IllegalArgumentException | IOException
to RuntimeException | IOException to catch ParquetCryptoRuntimeException
- Updated Javadoc examples and documentation
2. ParquetUtil.java
- Added public constants COLUMN_INDEX_TRUNCATE_LENGTH and
DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH (moved from ParquetWriter)
3. ParquetWriter.java
- Removed constants (moved to ParquetUtil)
- Updated to reference ParquetUtil constants
4. TestParquetFileMerger.java
- Created comprehensive unit tests for ParquetFileMerger
- Tests for canMerge() with various scenarios (empty lists, non-Parquet
files, schema mismatches, etc.)
5. SparkParquetFileMergeRunner.java
- Renamed validateAndGetSchema() → canMerge() returning boolean
- Removed ValidationResult dependency
- Simplified to use canMerge() for validation instead of returning
schema/metadata
- Schema and metadata now read on executor from input files (reducing
serialization)
6. TestSparkParquetFileMergeRunner.java
- Updated tests to use canMerge() instead of validateAndGetSchema()
- Removed reflection-based testing code
- Simplified test assertions to work with boolean return values
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]