Re: [PR] Restore custom SchemaAdapter functionality for Parquet [datafusion]

via GitHub Wed, 16 Jul 2025 04:53:09 -0700


alamb commented on code in PR #16791:
URL: https://github.com/apache/datafusion/pull/16791#discussion_r2210136013



##########
docs/source/library-user-guide/upgrading.md:
##########
@@ -120,6 +120,17 @@ SET datafusion.execution.spill_compression = 'zstd';
 
 For more details about this configuration option, including performance 
trade-offs between different compression codecs, see the [Configuration 
Settings](../user-guide/configs.md) documentation.
 
+### Custom `SchemaAdapterFactory` will no longer be used for predicate pushdown
+
+We are moving away from converting data (using `SchemaAdapter`) to converting 
the expressions themselves (which is more efficient and flexible).

Review Comment:
   I think we should also link to a ticket that describes the  plan and 
backstory to move away from SchemaAdpater to rewriting the expressions. This 
both provides an avenue for feedback as well as additional information for 
people upgrading



##########
datafusion/datasource-parquet/src/row_filter.rs:
##########
@@ -106,6 +106,8 @@ pub(crate) struct DatafusionArrowPredicate {
     rows_matched: metrics::Count,
     /// how long was spent evaluating this predicate
     time: metrics::Time,
+    /// used to perform type coercion while filtering rows

Review Comment:
   I think it is a bit unclear how the schema mapper and expression rewriter 
work together -- I think it is the case that the schema is mapped first and 
then the simplified physical expression is evaluated against the mapped schema 
rather than the file schema
   
   Maybe we can add a comment explaining how this works



##########
docs/source/library-user-guide/upgrading.md:
##########
@@ -120,6 +120,17 @@ SET datafusion.execution.spill_compression = 'zstd';
 
 For more details about this configuration option, including performance 
trade-offs between different compression codecs, see the [Configuration 
Settings](../user-guide/configs.md) documentation.
 
+### Custom `SchemaAdapterFactory` will no longer be used for predicate pushdown
+
+We are moving away from converting data (using `SchemaAdapter`) to converting 
the expressions themselves (which is more efficient and flexible).
+The first place this change has taken place is in predicate pushdown for 
Parquet.
+By default if you do not use a custom `SchemaAdapterFactory` we will use 
expression conversion instead.
+If you do set a custom `SchemaAdapterFactory` we will continue to use it but 
emit a warning about that code path being deprecated.
+
+To resolve this you need to implement a custom `PhysicalExprAdapterFactory` 
and use that instead of a `SchemaAdapterFactory`.
+See the [default 
values](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/default_column_values.rs)
 for an example of how to do this.
+Opting into the new APIs will set you up for future changes since we plan to 
expand use of `PhysicalExprAdapterFactory` to other areas of DataFusion.

Review Comment:
   A link to some description of the future plan would be super helpful here



##########
datafusion/datasource-parquet/src/opener.rs:
##########
@@ -1095,4 +1124,167 @@ mod test {
         assert_eq!(num_batches, 0);
         assert_eq!(num_rows, 0);
     }
+
+    fn get_value(metrics: &MetricsSet, metric_name: &str) -> usize {
+        match metrics.sum_by_name(metric_name) {
+            Some(v) => v.as_usize(),
+            _ => {
+                panic!(
+                    "Expected metric not found. Looking for '{metric_name}' 
in\n\n{metrics:#?}"
+                );
+            }
+        }
+    }
+
+    #[tokio::test]
+    async fn test_custom_schema_adapter_no_rewriter() {

Review Comment:
   If possible, I think this test should be of the "end to end" variety (in 
`core_integration`) that shows how these APIs interact to rewrite predicates / 
schemas correctly. It is not super clear to me how these low level APIs would 
be used by users and thus not sure if this test covers the cases correctly



##########
datafusion/datasource-parquet/src/row_filter.rs:
##########
@@ -140,6 +143,8 @@ impl ArrowPredicate for DatafusionArrowPredicate {
     }
 
     fn evaluate(&mut self, batch: RecordBatch) -> ArrowResult<BooleanArray> {
+        let batch = self.schema_mapper.map_batch(batch)?;

Review Comment:
   Applying the schema mapper first means the predicate is applied on batches 
that have been mapped to the table schema, but wasn't the predicate rewritten 
to be in terms of the file schema?



##########
docs/source/library-user-guide/upgrading.md:
##########
@@ -120,6 +120,17 @@ SET datafusion.execution.spill_compression = 'zstd';
 
 For more details about this configuration option, including performance 
trade-offs between different compression codecs, see the [Configuration 
Settings](../user-guide/configs.md) documentation.
 
+### Custom `SchemaAdapterFactory` will no longer be used for predicate pushdown

Review Comment:
   Is the plan to avoid using Schema Adapters for all schema conversion, or 
just predicate pushdown?
   
   ```suggestion
   ### Deprecating `SchemaAdapterFactory` and `SchemaAdapter`
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Restore custom SchemaAdapter functionality for Parquet [datafusion]

Reply via email to