Re: [PR] feat: Add Spark-compatible decimal division [datafusion]

2026-01-04 Thread via GitHub
Jefffrey commented on code in PR #19628: URL: https://github.com/apache/datafusion/pull/19628#discussion_r2659502642 ## datafusion/spark/src/function/math/decimal_div.rs: ## @@ -0,0 +1,434 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor

[PR] Null-aware LeftAnti Join [datafusion]

2026-01-04 Thread via GitHub
viirya opened a new pull request, #19635: URL: https://github.com/apache/datafusion/pull/19635 ## Which issue does this PR close? - Closes #10583. ## Rationale for this change ## What changes are included in this PR? ## Are these changes tes

Re: [I] DataFusion HashJoin LeftAnti doesn't support null aware anti join [datafusion]

2026-01-04 Thread via GitHub
viirya commented on issue #10583: URL: https://github.com/apache/datafusion/issues/10583#issuecomment-3707882918 @comphead I opened #19635 to fix this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to g

[PR] Snowflake: add Snowflake multi table insert support & add support for sample in subquery [datafusion-sqlparser-rs]

2026-01-04 Thread via GitHub
finchxxia opened a new pull request, #2148: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/2148 1. I found that datafusion-sqlparser-rs cannot support [multi-table insert](https://docs.snowflake.com/en/sql-reference/sql/insert-multi-table) currently. ``` -- Unconditional

Re: [PR] fix: format decimal to string when casting to short [datafusion-comet]

2026-01-04 Thread via GitHub
wForget commented on PR #2916: URL: https://github.com/apache/datafusion-comet/pull/2916#issuecomment-3707850222 @manuzhang The `cast_decimal_to_int32_up` function also has a similar issue. Could you fix it as well? Reproduce test case: ``` castTest( generateD

Re: [PR] Feat : added truncate table support [datafusion]

2026-01-04 Thread via GitHub
Nachiket-Roy commented on PR #19633: URL: https://github.com/apache/datafusion/pull/19633#issuecomment-3707853176 @ethan-tyler, please review this PR. Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] Optimize `Nullstate` / accumulators [datafusion]

2026-01-04 Thread via GitHub
Dandandan commented on PR #19625: URL: https://github.com/apache/datafusion/pull/19625#issuecomment-3707934127 run benchmarks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] Optimize `Nullstate` / accumulators [datafusion]

2026-01-04 Thread via GitHub
alamb-ghbot commented on PR #19625: URL: https://github.com/apache/datafusion/pull/19625#issuecomment-3707934201 ๐Ÿค– `./gh_compare_branch.sh` [gh_compare_branch.sh](https://github.com/alamb/datafusion-benchmarking/blob/main/scripts/gh_compare_branch.sh) Running Linux aal-dev 6.14.0-1018-gc

Re: [PR] perf: optimize `HashTableLookupExpr::evaluate` [datafusion]

2026-01-04 Thread via GitHub
UBarney commented on code in PR #19602: URL: https://github.com/apache/datafusion/pull/19602#discussion_r2659556702 ## datafusion/physical-plan/src/joins/hash_join/partitioned_hash_eval.rs: ## @@ -327,12 +329,24 @@ impl PhysicalExpr for HashTableLookupExpr { Ok(false)

Re: [PR] perf: optimize `HashTableLookupExpr::evaluate` [datafusion]

2026-01-04 Thread via GitHub
UBarney commented on code in PR #19602: URL: https://github.com/apache/datafusion/pull/19602#discussion_r2659557626 ## datafusion/physical-plan/src/joins/hash_join/partitioned_hash_eval.rs: ## @@ -327,12 +329,24 @@ impl PhysicalExpr for HashTableLookupExpr { Ok(false)

Re: [PR] Add one-step FilterExec creation with projection (#19608) [datafusion]

2026-01-04 Thread via GitHub
GaneshPatil7517 commented on PR #19619: URL: https://github.com/apache/datafusion/pull/19619#issuecomment-3708094406 @nuno-faria Please Review this... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

[PR] chore: update ballista version to 51.0.0 (from 50.0.0) [datafusion-ballista]

2026-01-04 Thread via GitHub
milenkovicm opened a new pull request, #1363: URL: https://github.com/apache/datafusion-ballista/pull/1363 # Which issue does this PR close? as part of datafusion upgrade we missed upgrading ballista versions as well Closes #. # Rationale for this change # What ch

Re: [PR] Row group limit pruning [datafusion]

2026-01-04 Thread via GitHub
xudong963 commented on PR #18868: URL: https://github.com/apache/datafusion/pull/18868#issuecomment-3708738865 Hey @alamb @adriangb, do you have time to review the PR? It would be sweet to have it in 52.0.0 -- This is an automated message from the Apache Git Service. To respond to the mes

Re: [PR] Validate parquet writer version [datafusion]

2026-01-04 Thread via GitHub
Jefffrey merged PR #19515: URL: https://github.com/apache/datafusion/pull/19515 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dataf

Re: [PR] Validate parquet writer version [datafusion]

2026-01-04 Thread via GitHub
Jefffrey commented on PR #19515: URL: https://github.com/apache/datafusion/pull/19515#issuecomment-3708575538 Thanks @AlyAbdelmoneim, @xudong963 & @alamb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above t

Re: [I] `array_union` and `array_intersect` cannot handle NULL columnar data [datafusion]

2026-01-04 Thread via GitHub
Jefffrey closed issue #9706: `array_union` and `array_intersect` cannot handle NULL columnar data URL: https://github.com/apache/datafusion/issues/9706 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go t

Re: [PR] fix: NULL handling in arrow_intersect and arrow_union [datafusion]

2026-01-04 Thread via GitHub
Jefffrey merged PR #19415: URL: https://github.com/apache/datafusion/pull/19415 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dataf

Re: [PR] fix: NULL handling in arrow_intersect and arrow_union [datafusion]

2026-01-04 Thread via GitHub
Jefffrey commented on PR #19415: URL: https://github.com/apache/datafusion/pull/19415#issuecomment-3708582155 Thanks for debugging and fixing this @feniljain -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abo

Re: [PR] feat: support `substrait_plan` and remove deprecated `sql` field from `ExecuteQueryParams.Query` [datafusion-ballista]

2026-01-04 Thread via GitHub
mattcuento commented on PR #1360: URL: https://github.com/apache/datafusion-ballista/pull/1360#issuecomment-3708756589 > there is one issue with docker build and looks like issue with disk space failing other (not sure how to fix) Thanks, looks like `substrait` doesn't run a high enough

Re: [PR] chore: Improve microbenchmark for string expressions [datafusion-comet]

2026-01-04 Thread via GitHub
coderfender commented on PR #2964: URL: https://github.com/apache/datafusion-comet/pull/2964#issuecomment-3708612474 ``` String expressions

Re: [I] Split built in functions into "packages" [datafusion]

2026-01-04 Thread via GitHub
Jefffrey closed issue #7110: Split built in functions into "packages" URL: https://github.com/apache/datafusion/issues/7110 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To u

Re: [I] Split built in functions into "packages" [datafusion]

2026-01-04 Thread via GitHub
Jefffrey commented on issue #7110: URL: https://github.com/apache/datafusion/issues/7110#issuecomment-3708645005 Functions are now split into separate crates: - https://github.com/apache/datafusion/tree/main/datafusion/functions - https://github.com/apache/datafusion/tree/main/dataf

Re: [PR] Add one-step FilterExec creation with projection (#19608) [datafusion]

2026-01-04 Thread via GitHub
GaneshPatil7517 commented on PR #19619: URL: https://github.com/apache/datafusion/pull/19619#issuecomment-3708692079 there are 2 failing and 30 successful checks, let me solve this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

Re: [PR] fix: Maintain `SUM` precision during two-phase aggregation [datafusion]

2026-01-04 Thread via GitHub
github-actions[bot] commented on PR #17815: URL: https://github.com/apache/datafusion/pull/17815#issuecomment-3708711176 Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or

Re: [PR] Optimize `Nullstate` / accumulators [datafusion]

2026-01-04 Thread via GitHub
alamb-ghbot commented on PR #19625: URL: https://github.com/apache/datafusion/pull/19625#issuecomment-3707953409 ๐Ÿค–: Benchmark completed Details ``` Comparing HEAD and speedup_accumulate2 Benchmark clickbench_extended.json

[I] Optimize NullState for non-null data [datafusion]

2026-01-04 Thread via GitHub
Dandandan opened a new issue, #19636: URL: https://github.com/apache/datafusion/issues/19636 ### Is your feature request related to a problem or challenge? Currently, NullState allocates a boolean buffer for (group) accumulators that potentially has null values. ### Describ

Re: [PR] Optimize `Nullstate` / accumulators [datafusion]

2026-01-04 Thread via GitHub
alamb-ghbot commented on PR #19625: URL: https://github.com/apache/datafusion/pull/19625#issuecomment-3707957595 ๐Ÿค– `./gh_compare_branch.sh` [gh_compare_branch.sh](https://github.com/alamb/datafusion-benchmarking/blob/main/scripts/gh_compare_branch.sh) Running Linux aal-dev 6.14.0-1018-gc

Re: [PR] Optimize `Nullstate` / accumulators [datafusion]

2026-01-04 Thread via GitHub
Dandandan commented on PR #19625: URL: https://github.com/apache/datafusion/pull/19625#issuecomment-3707957474 run benchmark tpch -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comm

Re: [PR] Optimize `Nullstate` / accumulators [datafusion]

2026-01-04 Thread via GitHub
Dandandan commented on PR #19625: URL: https://github.com/apache/datafusion/pull/19625#issuecomment-3707955700 Query 1 is consistently 15%-20% faster with this change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use th

[I] Convert AVG(col) to SUM(x) / COUNT(*) [datafusion]

2026-01-04 Thread via GitHub
Dandandan opened a new issue, #19637: URL: https://github.com/apache/datafusion/issues/19637 ### Is your feature request related to a problem or challenge? _No response_ ### Describe the solution you'd like _No response_ ### Describe alternatives you've considered

Re: [I] Convert AVG(col) to SUM(x) / COUNT(*) [datafusion]

2026-01-04 Thread via GitHub
Dandandan commented on issue #19637: URL: https://github.com/apache/datafusion/issues/19637#issuecomment-3707964081 Here it contains an AI-assisted PoC: https://github.com/apache/datafusion/pull/19624 (need to iron out a type bug) -- This is an automated message from the Apache Git Servic

Re: [PR] Optimize `Nullstate` / accumulators [datafusion]

2026-01-04 Thread via GitHub
alamb-ghbot commented on PR #19625: URL: https://github.com/apache/datafusion/pull/19625#issuecomment-3707967583 ๐Ÿค–: Benchmark completed Details ``` Comparing HEAD and speedup_accumulate2 Benchmark tpch_sf1.json โ”โ”โ”

Re: [I] Date + interval returns a type inconsistent with other databases [datafusion]

2026-01-04 Thread via GitHub
kumarUjjawal commented on issue #19527: URL: https://github.com/apache/datafusion/issues/19527#issuecomment-3707844141 I was loooking into this issue and attempted to make `Date + Interval` return `Timestamp` instead of `Date`. What I did: 1. **Type Coercion** (`expr-common/src

Re: [I] [EPIC] Optimize performance for slow expressions [datafusion-comet]

2026-01-04 Thread via GitHub
coderfender commented on issue #2986: URL: https://github.com/apache/datafusion-comet/issues/2986#issuecomment-3707862395 @raushanprabhakar1 , you can run a local benchmark using a command like below : ``` SPARK_GENERATE_BENCHMARK_FILES=1 make benchmark-org.apache.spark.sql.benchmark.C

Re: [PR] feat: Add array concatenation support to concat function [datafusion]

2026-01-04 Thread via GitHub
Jefffrey commented on code in PR #18137: URL: https://github.com/apache/datafusion/pull/18137#discussion_r265967 ## datafusion/functions/src/string/concat.rs: ## @@ -501,4 +645,120 @@ mod tests { } Ok(()) } + +#[test] +fn test_concat_with_integ

Re: [PR] feat: support `substrait_plan` and remove deprecated `sql` field from `ExecuteQueryParams.Query` [datafusion-ballista]

2026-01-04 Thread via GitHub
milenkovicm commented on code in PR #1360: URL: https://github.com/apache/datafusion-ballista/pull/1360#discussion_r2659679232 ## ballista/scheduler/src/scheduler_server/grpc.rs: ## @@ -873,4 +873,77 @@ mod test { assert!(active_executors.is_empty()); Ok(())

Re: [PR] chore: update ballista version to 51.0.0 (from 50.0.0) [datafusion-ballista]

2026-01-04 Thread via GitHub
milenkovicm commented on PR #1363: URL: https://github.com/apache/datafusion-ballista/pull/1363#issuecomment-3708105715 perhaps @martin-g or @danielhumanmod could help with review, we have missed to increment ballista version as part of #1345 -- This is an automated message from the

[PR] feat: Support basic Delta scans [datafusion-comet]

2026-01-04 Thread via GitHub
Kimahriman opened a new pull request, #3035: URL: https://github.com/apache/datafusion-comet/pull/3035 ## Which issue does this PR close? Related to #174, not full support so probably should keep that open (or open new tickets specifically for column mapping and deletion vecto

Re: [PR] feat: Support basic Delta scans [datafusion-comet]

2026-01-04 Thread via GitHub
Kimahriman commented on code in PR #3035: URL: https://github.com/apache/datafusion-comet/pull/3035#discussion_r2659688631 ## native/core/Cargo.toml: ## @@ -76,7 +76,7 @@ parking_lot = "0.12.5" datafusion-comet-objectstore-hdfs = { path = "../hdfs", optional = true, default-fe

Re: [PR] feat: Support basic Delta scans [datafusion-comet]

2026-01-04 Thread via GitHub
Kimahriman commented on code in PR #3035: URL: https://github.com/apache/datafusion-comet/pull/3035#discussion_r2659689056 ## spark/pom.xml: ## @@ -112,6 +112,12 @@ under the License. + + com.google.guava + failureaccess + 1.0.3 +

[I] Support DISTINCT ORDER BY LIMIT query use GroupedTopKAggregateStream [datafusion]

2026-01-04 Thread via GitHub
haohuaijin opened a new issue, #19638: URL: https://github.com/apache/datafusion/issues/19638 ### Is your feature request related to a problem or challenge? current the `GroupedTopKAggregateStream` support two type of query ```sql select id, max(time) from t group by id order by

Re: [I] Support DISTINCT ORDER BY LIMIT query use GroupedTopKAggregateStream [datafusion]

2026-01-04 Thread via GitHub
haohuaijin commented on issue #19638: URL: https://github.com/apache/datafusion/issues/19638#issuecomment-3708117313 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [PR] feat: Support basic Delta scans [datafusion-comet]

2026-01-04 Thread via GitHub
Kimahriman commented on code in PR #3035: URL: https://github.com/apache/datafusion-comet/pull/3035#discussion_r2659688631 ## native/core/Cargo.toml: ## @@ -76,7 +76,7 @@ parking_lot = "0.12.5" datafusion-comet-objectstore-hdfs = { path = "../hdfs", optional = true, default-fe

Re: [PR] feat: Support basic Delta scans [datafusion-comet]

2026-01-04 Thread via GitHub
Kimahriman commented on code in PR #3035: URL: https://github.com/apache/datafusion-comet/pull/3035#discussion_r2659689056 ## spark/pom.xml: ## @@ -112,6 +112,12 @@ under the License. + + com.google.guava + failureaccess + 1.0.3 +

Re: [I] Support DISTINCT ORDER BY LIMIT query use GroupedTopKAggregateStream [datafusion]

2026-01-04 Thread via GitHub
GaneshPatil7517 commented on issue #19638: URL: https://github.com/apache/datafusion/issues/19638#issuecomment-3708907086 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment

Re: [PR] feat: adaptive filter selectivity tracking for Parquet row filters [datafusion]

2026-01-04 Thread via GitHub
GaneshPatil7517 commented on PR #19639: URL: https://github.com/apache/datafusion/pull/19639#issuecomment-3708922989 hey @adriangb can i work on this...? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [PR] feat: adaptive filter selectivity tracking for Parquet row filters [datafusion]

2026-01-04 Thread via GitHub
adriangb commented on PR #19639: URL: https://github.com/apache/datafusion/pull/19639#issuecomment-3708924515 This is probably not a good issue to pick up. This is a draft PR for an unproven idea. -- This is an automated message from the Apache Git Service. To respond to the message, plea

Re: [PR] Add one-step FilterExec creation with projection (#19608) [datafusion]

2026-01-04 Thread via GitHub
GaneshPatil7517 commented on PR #19619: URL: https://github.com/apache/datafusion/pull/19619#issuecomment-3708932402 Hi @nuno-faria & @adriangb , All checks have passed successfully on this PR. Iโ€™m really excited to see it get merged ๐Ÿš€ Kindly request you to review and approve when

[I] DynamicFilterPhysicalExpr violates Hash/Eq contract [datafusion]

2026-01-04 Thread via GitHub
adriangb opened a new issue, #19641: URL: https://github.com/apache/datafusion/issues/19641 ### Describe the bug Because Hash and Eq take out separate locks it's possible the underlying expression changes in between calls. Thus you get the same hash but not matches for equality. I th

[PR] feat: add Time type support to date_trunc function [datafusion]

2026-01-04 Thread via GitHub
kumarUjjawal opened a new pull request, #19640: URL: https://github.com/apache/datafusion/pull/19640 ## Which issue does this PR close? - Part of #19025. ## Rationale for this change ## What changes are included in this PR? - Added Time64/Time32 sig

Re: [I] Support DISTINCT ORDER BY LIMIT query use GroupedTopKAggregateStream [datafusion]

2026-01-04 Thread via GitHub
GaneshPatil7517 commented on issue #19638: URL: https://github.com/apache/datafusion/issues/19638#issuecomment-3708936000 hey @nuno-faria Can i work on this'? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] feat: adaptive filter selectivity tracking for Parquet row filters [datafusion]

2026-01-04 Thread via GitHub
alamb-ghbot commented on PR #19639: URL: https://github.com/apache/datafusion/pull/19639#issuecomment-3708960220 ๐Ÿค– `./gh_compare_branch.sh` [gh_compare_branch.sh](https://github.com/alamb/datafusion-benchmarking/blob/main/scripts/gh_compare_branch.sh) Running Linux aal-dev 6.14.0-1018-gc

Re: [PR] feat: adaptive filter selectivity tracking for Parquet row filters [datafusion]

2026-01-04 Thread via GitHub
adriangb commented on PR #19639: URL: https://github.com/apache/datafusion/pull/19639#issuecomment-3708960071 run benchmark tpch -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comme

Re: [PR] feat: adaptive filter selectivity tracking for Parquet row filters [datafusion]

2026-01-04 Thread via GitHub
alamb-ghbot commented on PR #19639: URL: https://github.com/apache/datafusion/pull/19639#issuecomment-3708977753 ๐Ÿค–: Benchmark completed Details ``` Comparing HEAD and filter-pushdown-dynamic Benchmark tpch_sf1.json

Re: [PR] Add one-step FilterExec creation with projection (#19608) [datafusion]

2026-01-04 Thread via GitHub
GaneshPatil7517 commented on PR #19619: URL: https://github.com/apache/datafusion/pull/19619#issuecomment-3708978142 ok ill work on that... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spe

Re: [PR] Add one-step FilterExec creation with projection (#19608) [datafusion]

2026-01-04 Thread via GitHub
GaneshPatil7517 commented on PR #19619: URL: https://github.com/apache/datafusion/pull/19619#issuecomment-3708986247 hey @adriangb please can you review it i updated it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and u

Re: [PR] perf: Improve string to int perf [datafusion-comet]

2026-01-04 Thread via GitHub
coderfender commented on PR #3017: URL: https://github.com/apache/datafusion-comet/pull/3017#issuecomment-3709013471 ``` | Type | Before (main) | After (feature) | Improvement | |--|---|-|-| | i8 | 26.5 ยตs | 19.8 ยตs |

Re: [PR] fix: format decimal to string when casting to short [datafusion-comet]

2026-01-04 Thread via GitHub
wForget merged PR #2916: URL: https://github.com/apache/datafusion-comet/pull/2916 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@da

Re: [PR] fix: format decimal to string when casting to short [datafusion-comet]

2026-01-04 Thread via GitHub
wForget commented on PR #2916: URL: https://github.com/apache/datafusion-comet/pull/2916#issuecomment-3709003717 Thanks @manuzhang, merged to main -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [I] `cast_decimal_to_int16_down` formats decimal value incorrectly [datafusion-comet]

2026-01-04 Thread via GitHub
wForget closed issue #2914: `cast_decimal_to_int16_down` formats decimal value incorrectly URL: https://github.com/apache/datafusion-comet/issues/2914 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] fix(functions): Make translate function postgres compatible [datafusion]

2026-01-04 Thread via GitHub
devanshu0987 commented on PR #19630: URL: https://github.com/apache/datafusion/pull/19630#issuecomment-3709017147 Hi @Jefffrey, is there anything more I have to do here? What is the process to merge it into the main? -- This is an automated message from the Apache Git Service. To respond

Re: [PR] fix(functions): Make translate function postgres compatible [datafusion]

2026-01-04 Thread via GitHub
Jefffrey commented on PR #19630: URL: https://github.com/apache/datafusion/pull/19630#issuecomment-3709027867 > Hi @Jefffrey, is there anything more I have to do here? What is the process to merge it into the main? We generally like to leave PRs up for a while after approval in case a

Re: [PR] feat: Implement Spark function `space` [datafusion]

2026-01-04 Thread via GitHub
kazantsev-maksim commented on code in PR #19610: URL: https://github.com/apache/datafusion/pull/19610#discussion_r2659596647 ## datafusion/spark/src/function/string/space.rs: ## @@ -0,0 +1,245 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contribu

Re: [PR] Fix NULL handling in ScalarValue::partial_cmp (closes #19579) [datafusion]

2026-01-04 Thread via GitHub
Brijesh-Thakkar commented on PR #19587: URL: https://github.com/apache/datafusion/pull/19587#issuecomment-3708070847 @2010YOUY01 @Jefffrey Soorryy for wasting your time and efforts (I am new to this repo and open source) this wont be repeated again I will raise a new PR -- This is an a

Re: [PR] Fix NULL handling in ScalarValue::partial_cmp (closes #19579) [datafusion]

2026-01-04 Thread via GitHub
Brijesh-Thakkar commented on PR #19587: URL: https://github.com/apache/datafusion/pull/19587#issuecomment-3708070138 @2010YOUY01 I wanted to ask, can i close this pr and raise a new one, till u assign the issue to me, i will work on it and raise new pr, cuz i think i have messed up in this

Re: [PR] Fix NULL handling in ScalarValue::partial_cmp (closes #19579) [datafusion]

2026-01-04 Thread via GitHub
Brijesh-Thakkar closed pull request #19587: Fix NULL handling in ScalarValue::partial_cmp (closes #19579) URL: https://github.com/apache/datafusion/pull/19587 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

Re: [PR] Fix NULL handling in ScalarValue::partial_cmp (closes #19579) [datafusion]

2026-01-04 Thread via GitHub
Jefffrey commented on PR #19587: URL: https://github.com/apache/datafusion/pull/19587#issuecomment-3708080953 If you do intend to continue work on this, it would be preferable to keep this PR open (even if just in draft mode) so we don't lose discussion context. It's a commendable eff

Re: [I] [EPIC] Optimize performance for slow expressions [datafusion-comet]

2026-01-04 Thread via GitHub
Brijesh-Thakkar commented on issue #2986: URL: https://github.com/apache/datafusion-comet/issues/2986#issuecomment-3708080954 @coderfender How I run benchmarks, if i am doing PR in datafusion repo?? will this work there alsoo?? -- This is an automated message from the Apache Git Se

Re: [PR] perf: optimize octet_length for string arrays [datafusion]

2026-01-04 Thread via GitHub
Brijesh-Thakkar commented on PR #19581: URL: https://github.com/apache/datafusion/pull/19581#issuecomment-3708081881 @Jefffrey How can I run benchmarks locally?? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] Fix NULL handling in ScalarValue::partial_cmp (closes #19579) [datafusion]

2026-01-04 Thread via GitHub
Brijesh-Thakkar commented on PR #19587: URL: https://github.com/apache/datafusion/pull/19587#issuecomment-3708083626 @Jefffrey okk I have reopen this PR and will work on this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [PR] doc: Fix plan translation example to use correct aggregation and column [datafusion-ballista]

2026-01-04 Thread via GitHub
milenkovicm merged PR #1362: URL: https://github.com/apache/datafusion-ballista/pull/1362 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubsc

Re: [I] Enable parquet filter pushdown (`filter_pushdown`) by default [datafusion]

2026-01-04 Thread via GitHub
adriangb commented on issue #3463: URL: https://github.com/apache/datafusion/issues/3463#issuecomment-3708209559 Just throwing ideas at the wall just in case it helps. I feel like the fundamental problem (and I may be wrong about this) is that filter pushdown has a rather large I/O an

Re: [I] Enable parquet filter pushdown (`filter_pushdown`) by default [datafusion]

2026-01-04 Thread via GitHub
adriangb commented on issue #3463: URL: https://github.com/apache/datafusion/issues/3463#issuecomment-3708272050 > The new parquet pushdown sort of does this IIUC, but at the physical execution level - i.e. after the IO strategy is somewhat baked in AFAIK the only thing along these li

Re: [I] Enable parquet filter pushdown (`filter_pushdown`) by default [datafusion]

2026-01-04 Thread via GitHub
tustvold commented on issue #3463: URL: https://github.com/apache/datafusion/issues/3463#issuecomment-3708264117 > Has there been any attempts to keep track of filter selectivity and use that to our advantage? For example we could track filter selectivity for each filter and use that to:

Re: [PR] feat: support `substrait_plan` and remove deprecated `sql` field from `ExecuteQueryParams.Query` [datafusion-ballista]

2026-01-04 Thread via GitHub
milenkovicm commented on PR #1360: URL: https://github.com/apache/datafusion-ballista/pull/1360#issuecomment-3708276082 Also, could we gate substrait with config option, which could be on by default? Users not needing it could disable it at compile time. -- This is an automated messa

Re: [I] Enable parquet filter pushdown (`filter_pushdown`) by default [datafusion]

2026-01-04 Thread via GitHub
adriangb commented on issue #3463: URL: https://github.com/apache/datafusion/issues/3463#issuecomment-3708278842 I thought that changed how the row selection was represented / evaluated but did not actually move the filters out of the filter pushdown phase into the apply after scan w/ proje

Re: [I] Enable parquet filter pushdown (`filter_pushdown`) by default [datafusion]

2026-01-04 Thread via GitHub
tustvold commented on issue #3463: URL: https://github.com/apache/datafusion/issues/3463#issuecomment-3708276942 > After each filter for each RecordBatch is evaluated we re-order them and possibly toss the ones with poor selectivity back into the scan phase. I believe this is what htt

Re: [I] Enable parquet filter pushdown (`filter_pushdown`) by default [datafusion]

2026-01-04 Thread via GitHub
tustvold commented on issue #3463: URL: https://github.com/apache/datafusion/issues/3463#issuecomment-3708285486 It depends what you mean by IO ๐Ÿ˜…, if you mean fetching data from disk / network, you are correct predicate pushdown being discussed here (late materialization) does not influence

Re: [PR] Remove coalesce batches rule and deprecate CoalesceBatchesExec [datafusion]

2026-01-04 Thread via GitHub
feniljain commented on code in PR #19622: URL: https://github.com/apache/datafusion/pull/19622#discussion_r2659823598 ## datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs: ## @@ -564,6 +566,7 @@ fn test_pushdown_through_aggregates_on_grouping_columns() { // 2.

Re: [PR] Remove coalesce batches rule and deprecate CoalesceBatchesExec [datafusion]

2026-01-04 Thread via GitHub
feniljain commented on code in PR #19622: URL: https://github.com/apache/datafusion/pull/19622#discussion_r2659823598 ## datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs: ## @@ -564,6 +566,7 @@ fn test_pushdown_through_aggregates_on_grouping_columns() { // 2.

Re: [I] Enable parquet filter pushdown (`filter_pushdown`) by default [datafusion]

2026-01-04 Thread via GitHub
adriangb commented on issue #3463: URL: https://github.com/apache/datafusion/issues/3463#issuecomment-3708296137 > DF enabling filter pushdown will not influence the IO pattern to disk, and therefore this cannot be responsible for the regression in performance Ah maybe this is where m

Re: [I] Enable parquet filter pushdown (`filter_pushdown`) by default [datafusion]

2026-01-04 Thread via GitHub
tustvold commented on issue #3463: URL: https://github.com/apache/datafusion/issues/3463#issuecomment-3708304747 Oh right, yes it will do that sorry, been years since I wrote that code (and it looks like there's some new PushDecoder anyway that might change all of this). So yes it will beha

Re: [I] Enable parquet filter pushdown (`filter_pushdown`) by default [datafusion]

2026-01-04 Thread via GitHub
adriangb commented on issue #3463: URL: https://github.com/apache/datafusion/issues/3463#issuecomment-3708307317 Makes sense there's probably more than one issue to tackle. I just imagine that for the S3/GCS/etc. use case the extra I/O fetches would dominate, and might even be the same for

Re: [I] Convert AVG(col) to SUM(x) / COUNT(*) [datafusion]

2026-01-04 Thread via GitHub
HrithikSampson commented on issue #19637: URL: https://github.com/apache/datafusion/issues/19637#issuecomment-3708307932 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comm

Re: [I] Convert AVG(col) to SUM(x) / COUNT(*) [datafusion]

2026-01-04 Thread via GitHub
HrithikSampson commented on issue #19637: URL: https://github.com/apache/datafusion/issues/19637#issuecomment-3708308350 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] fix: Disallows dropping duplicate keys when using full outer join [datafusion-python]

2026-01-04 Thread via GitHub
renato2099 commented on PR #1320: URL: https://github.com/apache/datafusion-python/pull/1320#issuecomment-3708310578 Hi @kosiew , I have added some documentation notes. Let me know if this is sufficient, otherwise I can add more explanations. -- This is an automated message from t

Re: [PR] Null-aware LeftAnti Join [datafusion]

2026-01-04 Thread via GitHub
viirya commented on code in PR #19635: URL: https://github.com/apache/datafusion/pull/19635#discussion_r2659844657 ## datafusion/sqllogictest/test_files/joins.slt: ## @@ -3516,7 +3516,6 @@ AS VALUES query IT SELECT t1_id, t1_name FROM join_test_left WHERE t1_id NOT IN (SELECT

Re: [I] Automated way to run benchmarks on a dedicated machine from PRs [datafusion]

2026-01-04 Thread via GitHub
Omega359 commented on issue #18115: URL: https://github.com/apache/datafusion/issues/18115#issuecomment-3708188116 > I currently run this on a k8s cluster at my home, but this could run in the cloud if someone wanted to pay for that. I plan on restricting access to committers only. I am con

Re: [PR] perf: optimize octet_length for string arrays [datafusion]

2026-01-04 Thread via GitHub
Jefffrey commented on PR #19581: URL: https://github.com/apache/datafusion/pull/19581#issuecomment-3708153870 > @Jefffrey How can I run benchmarks locally?? See some examples of microbenchmarks here: https://github.com/apache/datafusion/pull/19551 They should be able to be run

Re: [PR] perf: optimize `HashTableLookupExpr::evaluate` [datafusion]

2026-01-04 Thread via GitHub
Dandandan commented on code in PR #19602: URL: https://github.com/apache/datafusion/pull/19602#discussion_r2659721105 ## datafusion/physical-plan/src/joins/hash_join/partitioned_hash_eval.rs: ## @@ -327,12 +329,24 @@ impl PhysicalExpr for HashTableLookupExpr { Ok(false)

Re: [PR] feat: support `substrait_plan` and remove deprecated `sql` field from `ExecuteQueryParams.Query` [datafusion-ballista]

2026-01-04 Thread via GitHub
milenkovicm commented on PR #1360: URL: https://github.com/apache/datafusion-ballista/pull/1360#issuecomment-3708132381 Maybe as a follow up we should put a bit more documentation around this and example(s) -- This is an automated message from the Apache Git Service. To respond to the me

Re: [PR] feat: Support basic Delta scans [datafusion-comet]

2026-01-04 Thread via GitHub
codecov-commenter commented on PR #3035: URL: https://github.com/apache/datafusion-comet/pull/3035#issuecomment-3708134680 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/3035?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] fix: format decimal to string when casting to short [datafusion-comet]

2026-01-04 Thread via GitHub
manuzhang commented on PR #2916: URL: https://github.com/apache/datafusion-comet/pull/2916#issuecomment-3708157211 @wForget Thanks for the good suggestion. I was struggling with test case so I left the changes to `cast_decimal_to_int32_up` out. I've added tests for `cast DecimalType(38,18)

Re: [PR] perf: optimize `HashTableLookupExpr::evaluate` [datafusion]

2026-01-04 Thread via GitHub
adriangb commented on PR #19602: URL: https://github.com/apache/datafusion/pull/19602#issuecomment-3708179452 It might be interesting to re-run https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/#hash-join-dynamic-filters and see if the numbers are even better now! -- This is

Re: [PR] feat: Add Spark-compatible `xxhash64` and `murmur3` hash functions [datafusion]

2026-01-04 Thread via GitHub
andygrove commented on code in PR #19627: URL: https://github.com/apache/datafusion/pull/19627#discussion_r2659849359 ## datafusion/spark/src/function/hash/murmur3_hash.rs: ## @@ -0,0 +1,474 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributo

Re: [PR] fix: Disallows dropping duplicate keys when using full outer join [datafusion-python]

2026-01-04 Thread via GitHub
renato2099 commented on PR #1320: URL: https://github.com/apache/datafusion-python/pull/1320#issuecomment-3708330200 I am thinking that we could have a follow up on this path to be more ergonomic though + a more future-proof API (non-breaking path). Basically, we could introduce an enum-li

Re: [I] Enable parquet filter pushdown (`filter_pushdown`) by default [datafusion]

2026-01-04 Thread via GitHub
tustvold commented on issue #3463: URL: https://github.com/apache/datafusion/issues/3463#issuecomment-3708332419 Yeah, it's a good point that whilst caching reduces the additional decode costs for pushing down predicates, it doesn't eliminate the IO costs. That being said in general you onl

Re: [PR] feat: Add null-aware anti join support [datafusion]

2026-01-04 Thread via GitHub
viirya commented on code in PR #19635: URL: https://github.com/apache/datafusion/pull/19635#discussion_r2659844657 ## datafusion/sqllogictest/test_files/joins.slt: ## @@ -3516,7 +3516,6 @@ AS VALUES query IT SELECT t1_id, t1_name FROM join_test_left WHERE t1_id NOT IN (SELECT

Re: [I] Enable parquet filter pushdown (`filter_pushdown`) by default [datafusion]

2026-01-04 Thread via GitHub
adriangb commented on issue #3463: URL: https://github.com/apache/datafusion/issues/3463#issuecomment-3708389645 Okay yes I agree maybe I was being pessimistic ๐Ÿ˜†. In any case using what we can from stats / metadata to set up the initial state / plan and then refining it once we have runtime

Re: [I] Enable parquet filter pushdown (`filter_pushdown`) by default [datafusion]

2026-01-04 Thread via GitHub
adriangb commented on issue #3463: URL: https://github.com/apache/datafusion/issues/3463#issuecomment-3708390932 > > arrow-rs at least exposed the selectivity of filters after each file is read > > It is possible to provide an implementation of ArrowPredicate that tracks this. IIRC t

  1   2   >