Re: [PR] Core, Data: File Format API interfaces [iceberg]

via GitHub Fri, 16 May 2025 17:31:48 -0700


stevenzwu commented on code in PR #12774:
URL: https://github.com/apache/iceberg/pull/12774#discussion_r2093626096



##########
core/src/main/java/org/apache/iceberg/io/ObjectModel.java:
##########
@@ -0,0 +1,106 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.io;
+
+import org.apache.iceberg.FileFormat;
+
+/**
+ * Direct conversion is used between file formats and engine internal formats 
for performance
+ * reasons. Object models encapsulate these conversions.
+ *
+ * <p>{@link ReadBuilder} is provided for reading data files stored in a given 
{@link FileFormat}
+ * into the engine specific object model.
+ *
+ * <p>{@link AppenderBuilder} is provided for writing engine specific object 
model to data/delete
+ * files stored in a given {@link FileFormat}.
+ *
+ * <p>Iceberg supports the following object models natively:
+ *
+ * <ul>
+ *   <li>generic - reads and writes Iceberg {@link 
org.apache.iceberg.data.Record}s
+ *   <li>spark - reads and writes Spark InternalRow records
+ *   <li>spark-vectorized - vectorized reads for Spark columnar batches. Not 
supported for {@link
+ *       FileFormat#AVRO}
+ *   <li>flink - reads and writes Flink RowData records
+ *   <li>arrow - vectorized reads for into Arrow columnar format. Only 
supported for {@link
+ *       FileFormat#PARQUET}
+ * </ul>
+ *
+ * <p>Engines could implement their own object models to leverage Iceberg data 
file reading and
+ * writing capabilities.
+ *
+ * @param <E> the engine specific schema of the input data for the appender
+ */
+public interface ObjectModel<E> {

Review Comment:
   `ObjectModel` seems too generic and doesn't really capture the 
responsibility of this factory class. Some alternatives to consider `IOFactory` 
or `DataIOFactory`. 
   
   `FileIO` is already taken for another context. Otherwise, `FileIOFactory` 
may be good.



##########
core/src/main/java/org/apache/iceberg/io/AppenderBuilder.java:
##########
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.io;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.Map;
+import org.apache.iceberg.MetricsConfig;
+import org.apache.iceberg.Schema;
+
+/**
+ * Interface which is implemented by the data file format implementations. The 
{@link ObjectModel}
+ * provides the {@link AppenderBuilder} for the given parameters:
+ *
+ * <ul>
+ *   <li>file format
+ *   <li>engine specific object model
+ *   <li>{@link ObjectModel.WriteMode}
+ * </ul>
+ *
+ * The {@link AppenderBuilder} is used to write data to the target files.
+ *
+ * @param <B> type returned by builder API to allow chained calls
+ * @param <E> the engine specific schema of the input data
+ */
+public interface AppenderBuilder<B extends AppenderBuilder<B, E>, E> {

Review Comment:
   I see. It is odd that the symmetric counter part of `ReadBuilder` is not 
`WriteBuilder`.
   
   our current class naming is a bit inconsistent in terms of `Writer` vs 
`Appender`. E.g. we have `ParquetWriter implements FileAppender`. In my mind, 
`FileAppender` should be named as `FileWriter` for write data to a specific 
file format. And the current `FileWriter` should be called `ContentFileWriter`, 
because `ContentFile` is a data, position delete, or equality delete file. But 
I guess we might be too late for that.
   



##########
core/src/main/java/org/apache/iceberg/io/ObjectModel.java:
##########
@@ -0,0 +1,106 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.io;
+
+import org.apache.iceberg.FileFormat;
+
+/**
+ * Direct conversion is used between file formats and engine internal formats 
for performance
+ * reasons. Object models encapsulate these conversions.
+ *
+ * <p>{@link ReadBuilder} is provided for reading data files stored in a given 
{@link FileFormat}
+ * into the engine specific object model.
+ *
+ * <p>{@link AppenderBuilder} is provided for writing engine specific object 
model to data/delete
+ * files stored in a given {@link FileFormat}.
+ *
+ * <p>Iceberg supports the following object models natively:
+ *
+ * <ul>
+ *   <li>generic - reads and writes Iceberg {@link 
org.apache.iceberg.data.Record}s
+ *   <li>spark - reads and writes Spark InternalRow records
+ *   <li>spark-vectorized - vectorized reads for Spark columnar batches. Not 
supported for {@link
+ *       FileFormat#AVRO}
+ *   <li>flink - reads and writes Flink RowData records
+ *   <li>arrow - vectorized reads for into Arrow columnar format. Only 
supported for {@link
+ *       FileFormat#PARQUET}
+ * </ul>
+ *
+ * <p>Engines could implement their own object models to leverage Iceberg data 
file reading and
+ * writing capabilities.
+ *
+ * @param <E> the engine specific schema of the input data for the appender
+ */
+public interface ObjectModel<E> {
+  /** The file format which is read/written by the object model. */
+  FileFormat format();
+
+  /**
+   * The name of the object model. Allows users to specify the object model to 
map the data file for
+   * reading and writing.
+   */
+  String name();
+
+  /**
+   * The appender builder for the output file which writes the data in the 
specified file format and
+   * accepts the records defined by this object model. The 'mode' parameter 
defines the input type
+   * for the specific writer use-cases. The appender should handle the 
following input in the
+   * specific modes:
+   *
+   * <ul>
+   *   <li>The appender's engine specific input type
+   *       <ul>
+   *         <li>{@link WriteMode#DATA_WRITER}
+   *         <li>{@link WriteMode#EQUALITY_DELETE_WRITER}
+   *       </ul>
+   *   <li>{@link org.apache.iceberg.deletes.PositionDelete} where the type of 
the row is the
+   *       appender's engine specific input type when the 'mode' is {@link
+   *       WriteMode#POSITION_DELETE_WRITER}
+   * </ul>
+   *
+   * @param outputFile to write to
+   * @param mode for the appender
+   * @return the appender builder
+   * @param <B> The type of the appender builder
+   */
+  <B extends AppenderBuilder<B, E>> B appenderBuilder(OutputFile outputFile, 
WriteMode mode);

Review Comment:
   wondering if we need the `WriteMode` enum. Somewhere downstream, there is 
probably a switch case.
   
   Instead, we can have 3 different methods like `dataWriteBuilder`, 
`positionDeleteWriteBuilder`, and `equalityDeleteWriteBuilder`.
   
   If we want to stick with a enum arg, we can probably just use the existing 
`FileContent` enum.



##########
data/src/main/java/org/apache/iceberg/data/FileWriteBuilderBase.java:
##########
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.data;
+
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.SortOrder;
+import org.apache.iceberg.StructLike;
+import org.apache.iceberg.deletes.EqualityDeleteWriter;
+import org.apache.iceberg.deletes.PositionDeleteWriter;
+import org.apache.iceberg.encryption.EncryptionKeyMetadata;
+import org.apache.iceberg.io.DataWriter;
+
+/**
+ * Builder for generating one of the following:
+ *
+ * <ul>
+ *   <li>{@link DataWriter}
+ *   <li>{@link EqualityDeleteWriter}
+ *   <li>{@link PositionDeleteWriter}
+ * </ul>
+ *
+ * @param <B> type of the builder
+ * @param <E> engine specific schema of the input records used for appender 
initialization
+ */
+interface FileWriteBuilderBase<B extends FileWriteBuilderBase<B, E>, E>

Review Comment:
   ContentFileWriteBuilder? also, can we merge the `WriteBuildBase` into this 
interface too?



##########
core/src/main/java/org/apache/iceberg/io/AppenderBuilder.java:
##########
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.io;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.Map;
+import org.apache.iceberg.MetricsConfig;
+import org.apache.iceberg.Schema;
+
+/**
+ * Interface which is implemented by the data file format implementations. The 
{@link ObjectModel}
+ * provides the {@link AppenderBuilder} for the given parameters:
+ *
+ * <ul>
+ *   <li>file format
+ *   <li>engine specific object model
+ *   <li>{@link ObjectModel.WriteMode}
+ * </ul>
+ *
+ * The {@link AppenderBuilder} is used to write data to the target files.
+ *
+ * @param <B> type returned by builder API to allow chained calls
+ * @param <E> the engine specific schema of the input data
+ */
+public interface AppenderBuilder<B extends AppenderBuilder<B, E>, E> {
+  /** Set the file schema. */
+  B schema(Schema newSchema);
+
+  /**
+   * Set a writer configuration property which affects the writer behavior.
+   *
+   * @param property a writer config property name
+   * @param value config value
+   * @return this for method chaining
+   */
+  B set(String property, String value);
+
+  default B set(Map<String, String> properties) {
+    properties.forEach(this::set);
+    return (B) this;
+  }
+
+  /**
+   * Set a file metadata property in the created file.
+   *
+   * @param property a file metadata property name
+   * @param value config value
+   * @return this for method chaining
+   */
+  B meta(String property, String value);
+
+  /** Sets the metrics configuration used for collecting column metrics for 
the created file. */
+  B metricsConfig(MetricsConfig newMetricsConfig);
+
+  /** Overwrite the file if it already exists. By default, overwrite is 
disabled. */
+  B overwrite();
+
+  /**
+   * Overwrite the file if it already exists. The default value is 
<code>false</code>.
+   *
+   * @deprecated Since 1.10.0, will be removed in 1.11.0. Only provided for 
backward compatibility.
+   *     Use {@link #overwrite()} instead.
+   */
+  @Deprecated
+  B overwrite(boolean enabled);
+
+  /**
+   * Sets the encryption key used for writing the file. If encryption is not 
supported by the reader
+   * then an exception should be thrown.
+   */
+  default B fileEncryptionKey(ByteBuffer encryptionKey) {
+    throw new UnsupportedOperationException("Not supported");
+  }
+
+  /**
+   * Sets the additional authentication data prefix used for writing the file. 
If encryption is not

Review Comment:
   nit: add the acronym in the comment. When I first saw `aadPrefix`, I thought 
it is a typo of `addPrefix` :)
   ```
   additional authentication data (aad)
   ```



##########
data/src/main/java/org/apache/iceberg/data/ObjectModelRegistry.java:
##########
@@ -0,0 +1,224 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.data;
+
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.FileFormat;
+import org.apache.iceberg.common.DynMethods;
+import org.apache.iceberg.encryption.EncryptedOutputFile;
+import org.apache.iceberg.io.AppenderBuilder;
+import org.apache.iceberg.io.InputFile;
+import org.apache.iceberg.io.ObjectModel;
+import org.apache.iceberg.io.ReadBuilder;
+import org.apache.iceberg.relocated.com.google.common.base.MoreObjects;
+import org.apache.iceberg.relocated.com.google.common.base.Objects;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Registry which provides the available {@link ReadBuilder}s, {@link 
AppenderBuilder}s and writer
+ * builders ({@link DataWriteBuilder}, {@link EqualityDeleteWriteBuilder}, 
{@link
+ * PositionDeleteWriteBuilder}). Based on the `file format` and the requested 
`object model name`
+ * the registry returns the correct reader and writer builders. These builders 
could be used to
+ * generate the readers and writers.
+ *
+ * <p>The available {@link ObjectModel}s are registered by the {@link
+ * #registerObjectModel(ObjectModel)} method. These {@link ObjectModel}s will 
be used to create the
+ * {@link ReadBuilder}s and the {@link AppenderBuilder}s. The former ones are 
returned directly, the
+ * later ones either returned directly or wrapped in the appropriate writer 
builder implementations.
+ */
+public final class ObjectModelRegistry {

Review Comment:
   shouldn't this be in the same package as `ObjectModel` in iceberg-core 
module?



##########
core/src/main/java/org/apache/iceberg/io/ReadBuilder.java:
##########
@@ -0,0 +1,123 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.io;
+
+import java.nio.ByteBuffer;
+import java.util.Map;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.expressions.Expression;
+import org.apache.iceberg.mapping.NameMapping;
+
+/**
+ * File formats should implement this interface to provide a builder for 
reading data files. {@link
+ * ReadBuilder} reads the data files with the specified parameters. The 
returned objects are defined
+ * by the {@link ObjectModel} which is used to read the data.
+ *
+ * <p>This interface is directly exposed for the users to parameterize readers.
+ *
+ * @param <B> type returned by builder API to allow chained calls
+ */
+public interface ReadBuilder<B extends ReadBuilder<B>> {
+  /** The configuration key for the batch size in case of vectorized reads. */
+  String RECORDS_PER_BATCH_KEY = "iceberg.records-per-batch";
+
+  /**
+   * Restricts the read to the given range: [start, start + length).
+   *
+   * @param newStart the start position for this read
+   * @param newLength the length of the range this read should scan
+   */
+  B split(long newStart, long newLength);
+
+  /** Read only the given columns. */
+  B project(Schema newSchema);
+
+  /**
+   * Pushes down the {@link Expression} filter for the reader to prevent 
reading unnecessary
+   * records. Some readers might not be able to filter some part of the 
expression. In this case the
+   * reader might return unfiltered or partially filtered rows. It is the 
caller's responsibility to
+   * apply the filter again.
+   *
+   * @param newFilter the filter to set
+   * @param filterCaseSensitive whether the filtering is case-sensitive or not
+   */
+  default B filter(Expression newFilter, boolean filterCaseSensitive) {
+    // Skip filtering if not available
+    return (B) this;
+  }
+
+  /**
+   * Pushes down the {@link Expression} filter for the reader to prevent 
reading unnecessary
+   * records. Some readers might not be able to filter some part of the 
exception. In this case the
+   * reader might return unfiltered or partially filtered rows. It is the 
caller's responsibility to
+   * apply the filter again. The default implementation sets the filter to be 
case-sensitive.
+   *
+   * @param newFilter the filter to set
+   */
+  default B filter(Expression newFilter) {
+    return filter(newFilter, true);
+  }
+
+  /**
+   * Sets configuration key/value pairs for the reader. Reader builders should 
ignore configuration
+   * keys not known for them.
+   */
+  default B set(String key, String value) {
+    // Skip configuration if not applicable
+    return (B) this;
+  }
+
+  /**
+   * Enables reusing the containers returned by the reader. Decreases pressure 
on GC. Readers could
+   * decide to ignore the user provided setting if is not supported by them.
+   */
+  default B reuseContainers() {
+    // Skip container reuse configuration if not applicable
+    return (B) this;
+  }
+
+  /**
+   * Accessors for constant field values. Used for calculating values in the 
result which are coming
+   * from metadata, and not coming from the data files themselves. The keys of 
the map are the
+   * column ids, the values are the accessors generating the values.
+   */
+  B constantFieldAccessors(Map<Integer, ?> constantFieldAccessors);
+
+  /** Sets a mapping from external schema names to Iceberg type IDs. */
+  B withNameMapping(NameMapping newNameMapping);
+
+  /**
+   * Sets the file encryption key used for reading the file. If encryption is 
not supported by the
+   * reader then an exception should be thrown.
+   */
+  default B withFileEncryptionKey(ByteBuffer encryptionKey) {
+    throw new UnsupportedOperationException("Not supported");
+  }
+
+  /**
+   * Sets the additional authentication data prefix for encryption. If 
encryption is not supported
+   * by the reader then an exception should be thrown.
+   */
+  default B withAADPrefix(ByteBuffer aadPrefix) {

Review Comment:
   inconsistent naming. some methods use `with`, while some don't



##########
core/src/main/java/org/apache/iceberg/io/ObjectModel.java:
##########
@@ -0,0 +1,106 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.io;
+
+import org.apache.iceberg.FileFormat;
+
+/**
+ * Direct conversion is used between file formats and engine internal formats 
for performance
+ * reasons. Object models encapsulate these conversions.
+ *
+ * <p>{@link ReadBuilder} is provided for reading data files stored in a given 
{@link FileFormat}
+ * into the engine specific object model.
+ *
+ * <p>{@link AppenderBuilder} is provided for writing engine specific object 
model to data/delete
+ * files stored in a given {@link FileFormat}.
+ *
+ * <p>Iceberg supports the following object models natively:
+ *
+ * <ul>
+ *   <li>generic - reads and writes Iceberg {@link 
org.apache.iceberg.data.Record}s
+ *   <li>spark - reads and writes Spark InternalRow records
+ *   <li>spark-vectorized - vectorized reads for Spark columnar batches. Not 
supported for {@link
+ *       FileFormat#AVRO}
+ *   <li>flink - reads and writes Flink RowData records
+ *   <li>arrow - vectorized reads for into Arrow columnar format. Only 
supported for {@link
+ *       FileFormat#PARQUET}
+ * </ul>
+ *
+ * <p>Engines could implement their own object models to leverage Iceberg data 
file reading and
+ * writing capabilities.
+ *
+ * @param <E> the engine specific schema of the input data for the appender
+ */
+public interface ObjectModel<E> {
+  /** The file format which is read/written by the object model. */
+  FileFormat format();
+
+  /**
+   * The name of the object model. Allows users to specify the object model to 
map the data file for
+   * reading and writing.
+   */
+  String name();
+
+  /**
+   * The appender builder for the output file which writes the data in the 
specified file format and
+   * accepts the records defined by this object model. The 'mode' parameter 
defines the input type
+   * for the specific writer use-cases. The appender should handle the 
following input in the
+   * specific modes:
+   *
+   * <ul>
+   *   <li>The appender's engine specific input type
+   *       <ul>
+   *         <li>{@link WriteMode#DATA_WRITER}
+   *         <li>{@link WriteMode#EQUALITY_DELETE_WRITER}
+   *       </ul>
+   *   <li>{@link org.apache.iceberg.deletes.PositionDelete} where the type of 
the row is the
+   *       appender's engine specific input type when the 'mode' is {@link
+   *       WriteMode#POSITION_DELETE_WRITER}
+   * </ul>
+   *
+   * @param outputFile to write to
+   * @param mode for the appender
+   * @return the appender builder
+   * @param <B> The type of the appender builder
+   */
+  <B extends AppenderBuilder<B, E>> B appenderBuilder(OutputFile outputFile, 
WriteMode mode);
+
+  /**
+   * The reader builder for the input file which reads the data from the 
specified file format and
+   * returns the records in this object model.
+   *
+   * @param inputFile to read from
+   * @return the reader builder
+   * @param <B> The type of the reader builder
+   */
+  <B extends ReadBuilder<B>> B readBuilder(InputFile inputFile);

Review Comment:
   nit: inconsistent naming: append and read (or) appender and reader.



##########
core/src/main/java/org/apache/iceberg/io/AppenderBuilder.java:
##########
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.io;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.Map;
+import org.apache.iceberg.MetricsConfig;
+import org.apache.iceberg.Schema;
+
+/**
+ * Interface which is implemented by the data file format implementations. The 
{@link ObjectModel}
+ * provides the {@link AppenderBuilder} for the given parameters:
+ *
+ * <ul>
+ *   <li>file format
+ *   <li>engine specific object model
+ *   <li>{@link ObjectModel.WriteMode}
+ * </ul>
+ *
+ * The {@link AppenderBuilder} is used to write data to the target files.
+ *
+ * @param <B> type returned by builder API to allow chained calls
+ * @param <E> the engine specific schema of the input data
+ */
+public interface AppenderBuilder<B extends AppenderBuilder<B, E>, E> {
+  /** Set the file schema. */
+  B schema(Schema newSchema);
+
+  /**
+   * Set a writer configuration property which affects the writer behavior.
+   *
+   * @param property a writer config property name
+   * @param value config value
+   * @return this for method chaining
+   */
+  B set(String property, String value);
+
+  default B set(Map<String, String> properties) {
+    properties.forEach(this::set);
+    return (B) this;
+  }
+
+  /**
+   * Set a file metadata property in the created file.
+   *
+   * @param property a file metadata property name
+   * @param value config value
+   * @return this for method chaining
+   */
+  B meta(String property, String value);
+
+  /** Sets the metrics configuration used for collecting column metrics for 
the created file. */
+  B metricsConfig(MetricsConfig newMetricsConfig);
+
+  /** Overwrite the file if it already exists. By default, overwrite is 
disabled. */
+  B overwrite();
+
+  /**
+   * Overwrite the file if it already exists. The default value is 
<code>false</code>.
+   *
+   * @deprecated Since 1.10.0, will be removed in 1.11.0. Only provided for 
backward compatibility.
+   *     Use {@link #overwrite()} instead.
+   */
+  @Deprecated
+  B overwrite(boolean enabled);
+
+  /**
+   * Sets the encryption key used for writing the file. If encryption is not 
supported by the reader
+   * then an exception should be thrown.
+   */
+  default B fileEncryptionKey(ByteBuffer encryptionKey) {
+    throw new UnsupportedOperationException("Not supported");
+  }
+
+  /**
+   * Sets the additional authentication data prefix used for writing the file. 
If encryption is not
+   * supported by the reader then an exception should be thrown.
+   */
+  default B aadPrefix(ByteBuffer aadPrefix) {
+    throw new UnsupportedOperationException("Not supported");
+  }
+
+  /**
+   * Sets the engine native schema for the input. Defines the input type when 
there is N to 1
+   * mapping between the engine type and the Iceberg type, and providing the 
Iceberg schema is not
+   * enough for the conversion.
+   */
+  B engineSchema(E newEngineSchema);

Review Comment:
   maybe `dataSchema` is better than `engineSchema`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core, Data: File Format API interfaces [iceberg]

Reply via email to