Re: [PR] Add ParquetFileMerger for efficient row-group level file merging [iceberg]

via GitHub Fri, 28 Nov 2025 00:48:10 -0800


pvary commented on code in PR #14435:
URL: https://github.com/apache/iceberg/pull/14435#discussion_r2570851771



##########
parquet/src/main/java/org/apache/iceberg/parquet/ParquetFileMerger.java:
##########
@@ -0,0 +1,186 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.parquet;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.Map;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.parquet.format.converter.ParquetMetadataConverter;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.parquet.hadoop.util.HadoopInputFile;
+import org.apache.parquet.schema.MessageType;
+
+/**
+ * Utility class for performing strict schema validation and merging of 
Parquet files at the
+ * row-group level.
+ *
+ * <p>This class ensures that all input files have identical Parquet schemas 
before merging. The
+ * merge operation is performed by copying row groups directly without
+ * serialization/deserialization, providing significant performance benefits 
over traditional
+ * read-rewrite approaches.
+ *
+ * <p>TODO: Encrypted tables are not supported
+ *
+ * <p>Key features:
+ *
+ * <ul>
+ *   <li>Zero-copy row group merging using {@link ParquetFileWriter#appendFile}
+ *   <li>Strict schema validation - all files must have identical {@link 
MessageType}
+ *   <li>Metadata merging for Iceberg-specific footer data
+ * </ul>
+ *
+ * <p>Typical usage:
+ *
+ * <pre>
+ * Configuration conf = new Configuration();
+ * List&lt;Path&gt; inputFiles = Arrays.asList(file1, file2, file3);
+ * Path outputFile = new Path("/path/to/output.parquet");
+ * ParquetFileMerger.mergeFiles(inputFiles, outputFile, conf);
+ * </pre>
+ */
+public class ParquetFileMerger {
+
+  private ParquetFileMerger() {
+    // Utility class - prevent instantiation
+  }
+
+  /**
+   * Merges multiple Parquet files into a single output file at the row-group 
level.
+   *
+   * <p>All input files must have identical Parquet schemas ({@link 
MessageType}), otherwise an
+   * exception is thrown. The merge is performed by copying row groups 
directly without
+   * serialization/deserialization.
+   *
+   * @param inputFiles List of input Parquet file paths to merge
+   * @param outputFile Output file path for the merged result
+   * @param conf Hadoop configuration to use for file operations
+   * @throws IOException if I/O error occurs during merge operation
+   * @throws IllegalArgumentException if no input files provided or schemas 
don't match
+   */
+  public static void mergeFiles(List<Path> inputFiles, Path outputFile, 
Configuration conf)
+      throws IOException {
+    mergeFiles(inputFiles, outputFile, null, conf);
+  }
+
+  /**
+   * Merges multiple Parquet files into a single output file at the row-group 
level with custom
+   * metadata.
+   *
+   * <p>All input files must have identical Parquet schemas ({@link 
MessageType}), otherwise an
+   * exception is thrown. The merge is performed by copying row groups 
directly without
+   * serialization/deserialization.
+   *
+   * @param inputFiles List of input Parquet file paths to merge
+   * @param outputFile Output file path for the merged result
+   * @param extraMetadata Additional metadata to include in the output file 
footer (can be null)
+   * @param conf Hadoop configuration to use for file operations
+   * @throws IOException if I/O error occurs during merge operation
+   * @throws IllegalArgumentException if no input files provided or schemas 
don't match
+   */
+  public static void mergeFiles(
+      List<Path> inputFiles, Path outputFile, Map<String, String> 
extraMetadata, Configuration conf)
+      throws IOException {
+    // Validate and get the common schema
+    MessageType schema = validateAndGetSchema(inputFiles, conf);
+
+    // Create the output Parquet file writer
+    try (ParquetFileWriter writer =
+        new ParquetFileWriter(conf, schema, outputFile, 
ParquetFileWriter.Mode.CREATE)) {
+
+      writer.start();
+
+      // Append each input file's row groups to the output
+      for (Path inputFile : inputFiles) {
+        writer.appendFile(HadoopInputFile.fromPath(inputFile, conf));

Review Comment:
   Thanks for fixing



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add ParquetFileMerger for efficient row-group level file merging [iceberg]

Reply via email to