huaxingao opened a new pull request, #9721: URL: https://github.com/apache/iceberg/pull/9721
This PR is to introduce a dynamic plugin mechanism to support Spark native execution engines, e.g. [Comet](https://github.com/apache/arrow-datafusion-comet) Currently in Iceberg, when vectorization is activated, Iceberg employs the `VectorizedReaderBuilder` to generate `VectorizedArrowReader` and `ColumnarBatchReader`, which are then used for batch reading. I propose to introduce a customized `VectorizedReaderBuilder` and a customized `ColumnarBatchReader`. At runtime, if the customized `VectorizedReaderBuilder` and `ColumnarBatchReader` are accessible, the system will leverage the native vectorized execution engines. In cases where these customized components are not available, Iceberg's standard `VectorizedReaderBuilder` and `ColumnarBatchReader` will be utilized for batch reading. A new `SparkSQLProperties.CUSTOMIZED_VECTORIZATION_IMPL` is added to specify the customized vectorization implementation. If `CUSTOMIZED_VECTORIZATION_IMPL` is not set, the default iceberg `SparkVectorizedReaderBuilder` and `ColumnarBatchReader` are used for batch reading. If `VECTORIZATION_IMPL` is set, the customized `SparkVectorizedReaderBuilder` and `ColumnarBatchReader`are used for batch reading. In addition, a new `SparkSQLProperties.CUSTOMIZED_VECTORIZATION_PROPERTY_PREFIX` is added to specify the prefix of the customized vectorization property keys. Using Apache Comet as an example, ``` SparkSession spark = SparkSession.builder() .master("local[2]") .config( SparkSQLProperties.CUSTOMIZED_VECTORIZATION_IMPL, "org.apache.comet.Comet") // CometConfig keys start with ‘spark.comet’. For example, // CometConf.COMET_USE_DECIMAL_128.key is ‘spark.comet.use.decimal128’ // CometConf.COMET_USE_LAZY_MATERIALIZATION’.key is // ‘spark.comet.use.lazyMaterialization’ // so we set SparkSQLProperties.CUSTOMIZED_VECTORIZATION_PROPERTY_PREFIX to // ‘spark.comet’ .config(SparkSQLProperties.CUSTOMIZED_VECTORIZATION_PROPERTY_PREFIX, "spark.comet") .config(CometConf.COMET_USE_DECIMAL_128().key(), "true") .config(CometConf.COMET_USE_LAZY_MATERIALIZATION().key(), "true") .enableHiveSupport() .getOrCreate(); ``` `A VectorizedUtil` class is added to dynamically load `SparkVectorizedReaderBuilder` and `BaseColumnarBatchReader`. The customized `VectorizedReaderBuilder` and a customized `ColumnarBatchReader` need to be implemented in the native engine (e.g. Comet). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org