Re: [PR] Core, Data: File Format API interfaces [iceberg]

via GitHub Wed, 21 May 2025 06:01:51 -0700


pvary commented on code in PR #12774:
URL: https://github.com/apache/iceberg/pull/12774#discussion_r2100237439



##########
core/src/main/java/org/apache/iceberg/io/WriteBuilder.java:
##########
@@ -0,0 +1,120 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.io;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.Map;
+import org.apache.iceberg.FileContent;
+import org.apache.iceberg.MetricsConfig;
+import org.apache.iceberg.Schema;
+
+/**
+ * Builder interface for creating file writers across supported data file 
formats. Each {@link
+ * FileAccessFactory} implementation provides appropriate {@link WriteBuilder} 
instances based on:
+ *
+ * <ul>
+ *   <li>target file format (Parquet, Avro, ORC)
+ *   <li>engine-specific object representation (spark, flink, generic, etc.)
+ *   <li>content type ({@link FileContent#DATA}, {@link 
FileContent#EQUALITY_DELETES}, {@link
+ *       FileContent#POSITION_DELETES})
+ * </ul>
+ *
+ * The {@link WriteBuilder} follows the builder pattern to configure and 
create {@link FileAppender}
+ * instances that write data to the target output files.
+ *
+ * @param <B> the concrete builder type for method chaining
+ * @param <E> engine-specific schema type for the input data records
+ */
+public interface WriteBuilder<B extends WriteBuilder<B, E>, E> {
+  /** Set the file schema. */
+  B schema(Schema newSchema);
+
+  /**
+   * Set a writer configuration property which affects the writer behavior.
+   *
+   * @param property a writer config property name
+   * @param value config value
+   * @return this for method chaining
+   */
+  B set(String property, String value);
+
+  default B set(Map<String, String> properties) {
+    properties.forEach(this::set);
+    return (B) this;
+  }
+
+  /**
+   * Set a file metadata property in the created file.
+   *
+   * @param property a file metadata property name
+   * @param value config value
+   * @return this for method chaining
+   */
+  B meta(String property, String value);
+
+  /** Sets the metrics configuration used for collecting column metrics for 
the created file. */
+  B metricsConfig(MetricsConfig newMetricsConfig);
+
+  /** Overwrite the file if it already exists. By default, overwrite is 
disabled. */
+  B overwrite();
+
+  /**
+   * Overwrite the file if it already exists. The default value is 
<code>false</code>.
+   *
+   * @deprecated Since 1.10.0, will be removed in 1.11.0. Only provided for 
backward compatibility.
+   *     Use {@link #overwrite()} instead.
+   */
+  @Deprecated
+  B overwrite(boolean enabled);
+
+  /**
+   * Sets the encryption key used for writing the file. If the reader does not 
support encryption,
+   * then an exception should be thrown.
+   */
+  default B fileEncryptionKey(ByteBuffer encryptionKey) {
+    throw new UnsupportedOperationException("Not supported");
+  }
+
+  /**
+   * Sets the additional authentication data (aad) prefix used for writing the 
file. If the reader
+   * does not support encryption, then an exception should be thrown.
+   */
+  default B aadPrefix(ByteBuffer aadPrefix) {
+    throw new UnsupportedOperationException("Not supported");
+  }
+
+  /**
+   * Sets the engine-specific schema for the input data records.
+   *
+   * <p>This method is necessary when the mapping between engine types and 
Iceberg types is not
+   * one-to-one. For example, when multiple engine types could map to the same 
Iceberg type, or when
+   * schema metadata beyond the structure is needed to properly interpret the 
data.
+   *
+   * <p>While the Iceberg schema defines the expected output structure, the 
engine schema provides
+   * the exact input format details needed for proper type conversion.
+   *
+   * @param newEngineSchema the native schema representation from the engine 
(Spark, Flink, etc.)
+   * @return this builder for method chaining
+   */
+  B dataSchema(E newEngineSchema);
+
+  /** Finalizes the configuration and builds the {@link FileAppender}. */
+  <D> FileAppender<D> build() throws IOException;

Review Comment:
   The `WriteBuilder` appender will change the input type based on the 
ContentType parameter passed on the constructor. If `D` is the type of the 
object model records, then for `DATA` and `EQUALITY_DELETES` the appender will 
expect `D`, but for `POSITION_DELETES` it will expect `PositionDelete<D>`.
   Also this would change the method signature for the existing 
ORC/Parquet/Avro`<D> CloseableIterable<D> ReadBuilder.build()`, <D> 
FileAppender<D> WriteBuilder.build()` methods. This makes creating backward 
compatible change even more convoluted, as we will need to create a new class, 
and map all the calls in the old one.
   
   We could explore this direction as well, but this deviates even further from 
the current solution, and we need strong community approval for it. One 
possibility what I can see is to extend the `FileAccessFactory` API:
   ```
   public interface FileAccessFactory<E, D> {
     FileFormat format();
     String objectModeName();
   
     <B extends ReadBuilder<B, D>> B readBuilder(InputFile inputFile);
     // For Writing D (Data and equality delete)
     <B extends WriteBuilder<B, E, D>> B writeBuilder(OutputFile outputFile, 
FileContent content);
   
     // For Writing PositionDelete<D>
     <B extends WriteBuilder<B, E, PositionDelete<D>>> B 
positionDeleteWriteBuilder(OutputFile outputFile, FileContent content);
   }
   ```
   
   I agree that the API would be nicer/cleaner. My question is "does it worth 
it"? Will I have reviewers how check the changes for this level of change?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core, Data: File Format API interfaces [iceberg]

Reply via email to