[PR] Dynamically support Spark native engine in Iceberg [iceberg]

via GitHub Tue, 13 Feb 2024 14:45:25 -0800


huaxingao opened a new pull request, #9721:
URL: https://github.com/apache/iceberg/pull/9721


   This PR is to introduce a dynamic plugin mechanism to support Spark native 
execution engines, e.g. 
[Comet](https://github.com/apache/arrow-datafusion-comet)
   
   Currently in Iceberg, when vectorization is activated, Iceberg employs the 
`VectorizedReaderBuilder` to generate `VectorizedArrowReader` and 
`ColumnarBatchReader`, which are then used for batch reading. I propose to 
introduce a customized `VectorizedReaderBuilder` and a customized 
`ColumnarBatchReader`. At runtime, if the customized `VectorizedReaderBuilder` 
and `ColumnarBatchReader` are accessible, the system will leverage the native 
vectorized execution engines. In cases where these customized components are 
not available, Iceberg's standard `VectorizedReaderBuilder` and 
`ColumnarBatchReader` will be utilized for batch reading. 
   
   A new `SparkSQLProperties.CUSTOMIZED_VECTORIZATION_IMPL` is added to specify 
the customized vectorization implementation.  If 
`CUSTOMIZED_VECTORIZATION_IMPL` is not set, the default iceberg 
`SparkVectorizedReaderBuilder` and `ColumnarBatchReader` are used for batch 
reading. If `VECTORIZATION_IMPL` is set, the customized 
`SparkVectorizedReaderBuilder` and `ColumnarBatchReader`are used for batch 
reading. In addition, a new 
`SparkSQLProperties.CUSTOMIZED_VECTORIZATION_PROPERTY_PREFIX` is added to 
specify the prefix of the customized vectorization property keys. Using Apache 
Comet as an example, 
   ```
    SparkSession spark =
      SparkSession.builder()
          .master("local[2]")
          .config(
              SparkSQLProperties.CUSTOMIZED_VECTORIZATION_IMPL,
              "org.apache.comet.Comet")
          // CometConfig keys start with ‘spark.comet’. For example,
          // CometConf.COMET_USE_DECIMAL_128.key is ‘spark.comet.use.decimal128’
          // CometConf.COMET_USE_LAZY_MATERIALIZATION’.key is 
          // ‘spark.comet.use.lazyMaterialization’
          // so we set 
SparkSQLProperties.CUSTOMIZED_VECTORIZATION_PROPERTY_PREFIX to
          // ‘spark.comet’
          .config(SparkSQLProperties.CUSTOMIZED_VECTORIZATION_PROPERTY_PREFIX, 
"spark.comet")
          .config(CometConf.COMET_USE_DECIMAL_128().key(), "true")
          .config(CometConf.COMET_USE_LAZY_MATERIALIZATION().key(), "true")
          .enableHiveSupport()
          .getOrCreate();
   ```
   `A VectorizedUtil` class is added to dynamically load 
`SparkVectorizedReaderBuilder` and `BaseColumnarBatchReader`.
   
   The customized `VectorizedReaderBuilder` and a customized 
`ColumnarBatchReader` need to be implemented in the native engine (e.g. Comet).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[PR] Dynamically support Spark native engine in Iceberg [iceberg]

Reply via email to