zhztheplayer opened a new pull request, #13433: URL: https://github.com/apache/iceberg/pull/13433
A patch to make the API `SparkBatch.createReaderFactory` customizable. ### Reason User might need to customize the Spark partition reader in deep without going through Iceberg's build-in `BaseReader` routine. For example, in the Apache Gluten project we are translating and sending a whole Iceberg `SparkInputPartition` to the native layer for Velox to process. All the remaining code in `BaseReader` / `BaseBatchReader` won't help much with that. It turned out that `SparkBatch.createReaderFactory` is a nice cut-in point for this customization because the returned object is a Spark `PartitionReaderFactory` which is a stable developer API. ### The Change (Only Spark 4.0 code is affected in the PR) A new Spark option is being added: ``` spark.sql.iceberg.partition-reader-factory.provider ``` of which the default value is: ``` org.apache.iceberg.spark.source.BaseSparkPartitionReaderFactoryProvider ``` The previous partition-reader creation logic is moved from `SparkBatch` to the default provider implementation `BaseSparkPartitionReaderFactoryProvider`. So user can customize an implementation to replace it. `BaseSparkPartitionReaderFactoryProvider` itself can be created programmatically from user end for fallback purpose. I.e, if the user implementation is not able to handle the input Spark partition, user could fallback the further processing back to it. The `SparkPartitionReaderFactoryProvider` API looks like: ```java public interface SparkPartitionReaderFactoryProvider { PartitionReaderFactory createReaderFactory(SparkPartitionReaderFactoryConf conf); } ``` For maximizing the forward compatibility, one single parameter `SparkPartitionReaderFactoryConf` is relied rather than multiple individual ones. `SparkPartitionReaderFactoryConf` is currently defined as: ```java @Value.Immutable public interface SparkPartitionReaderFactoryConf { SparkReadConf readConf(); Schema expectedSchema(); List<? extends ScanTaskGroup<?>> taskGroups(); } ``` Once the development goes and new parameters are about to added, user's implementation code of `SparkPartitionReaderFactoryProvider` won't have to be changed because we are only adding new methods to the conf class. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org