alamb commented on code in PR #16791:
URL: https://github.com/apache/datafusion/pull/16791#discussion_r2210136013
##########
docs/source/library-user-guide/upgrading.md:
##########
@@ -120,6 +120,17 @@ SET datafusion.execution.spill_compression = 'zstd';
For more details about this configuration option, including performance
trade-offs between different compression codecs, see the [Configuration
Settings](../user-guide/configs.md) documentation.
+### Custom `SchemaAdapterFactory` will no longer be used for predicate pushdown
+
+We are moving away from converting data (using `SchemaAdapter`) to converting
the expressions themselves (which is more efficient and flexible).
Review Comment:
I think we should also link to a ticket that describes the plan and
backstory to move away from SchemaAdpater to rewriting the expressions. This
both provides an avenue for feedback as well as additional information for
people upgrading
##########
datafusion/datasource-parquet/src/row_filter.rs:
##########
@@ -106,6 +106,8 @@ pub(crate) struct DatafusionArrowPredicate {
rows_matched: metrics::Count,
/// how long was spent evaluating this predicate
time: metrics::Time,
+ /// used to perform type coercion while filtering rows
Review Comment:
I think it is a bit unclear how the schema mapper and expression rewriter
work together -- I think it is the case that the schema is mapped first and
then the simplified physical expression is evaluated against the mapped schema
rather than the file schema
Maybe we can add a comment explaining how this works
##########
docs/source/library-user-guide/upgrading.md:
##########
@@ -120,6 +120,17 @@ SET datafusion.execution.spill_compression = 'zstd';
For more details about this configuration option, including performance
trade-offs between different compression codecs, see the [Configuration
Settings](../user-guide/configs.md) documentation.
+### Custom `SchemaAdapterFactory` will no longer be used for predicate pushdown
+
+We are moving away from converting data (using `SchemaAdapter`) to converting
the expressions themselves (which is more efficient and flexible).
+The first place this change has taken place is in predicate pushdown for
Parquet.
+By default if you do not use a custom `SchemaAdapterFactory` we will use
expression conversion instead.
+If you do set a custom `SchemaAdapterFactory` we will continue to use it but
emit a warning about that code path being deprecated.
+
+To resolve this you need to implement a custom `PhysicalExprAdapterFactory`
and use that instead of a `SchemaAdapterFactory`.
+See the [default
values](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/default_column_values.rs)
for an example of how to do this.
+Opting into the new APIs will set you up for future changes since we plan to
expand use of `PhysicalExprAdapterFactory` to other areas of DataFusion.
Review Comment:
A link to some description of the future plan would be super helpful here
##########
datafusion/datasource-parquet/src/opener.rs:
##########
@@ -1095,4 +1124,167 @@ mod test {
assert_eq!(num_batches, 0);
assert_eq!(num_rows, 0);
}
+
+ fn get_value(metrics: &MetricsSet, metric_name: &str) -> usize {
+ match metrics.sum_by_name(metric_name) {
+ Some(v) => v.as_usize(),
+ _ => {
+ panic!(
+ "Expected metric not found. Looking for '{metric_name}'
in\n\n{metrics:#?}"
+ );
+ }
+ }
+ }
+
+ #[tokio::test]
+ async fn test_custom_schema_adapter_no_rewriter() {
Review Comment:
If possible, I think this test should be of the "end to end" variety (in
`core_integration`) that shows how these APIs interact to rewrite predicates /
schemas correctly. It is not super clear to me how these low level APIs would
be used by users and thus not sure if this test covers the cases correctly
##########
datafusion/datasource-parquet/src/row_filter.rs:
##########
@@ -140,6 +143,8 @@ impl ArrowPredicate for DatafusionArrowPredicate {
}
fn evaluate(&mut self, batch: RecordBatch) -> ArrowResult<BooleanArray> {
+ let batch = self.schema_mapper.map_batch(batch)?;
Review Comment:
Applying the schema mapper first means the predicate is applied on batches
that have been mapped to the table schema, but wasn't the predicate rewritten
to be in terms of the file schema?
##########
docs/source/library-user-guide/upgrading.md:
##########
@@ -120,6 +120,17 @@ SET datafusion.execution.spill_compression = 'zstd';
For more details about this configuration option, including performance
trade-offs between different compression codecs, see the [Configuration
Settings](../user-guide/configs.md) documentation.
+### Custom `SchemaAdapterFactory` will no longer be used for predicate pushdown
Review Comment:
Is the plan to avoid using Schema Adapters for all schema conversion, or
just predicate pushdown?
```suggestion
### Deprecating `SchemaAdapterFactory` and `SchemaAdapter`
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]