Re: [PR] fix: audit array_insert expression for correctness and test coverage [datafusion-comet]

2026-04-02 Thread via GitHub
andygrove commented on PR #3890: URL: https://github.com/apache/datafusion-comet/pull/3890#issuecomment-4180969473 @kazuyukitanimura @martin-g This PR was created using the audit skill that you reviewed -- This is an automated message from the Apache Git Service. To respond to the messag

Re: [I] We do not respect ignoreNulls in first_value / last_value aggregates [datafusion-comet]

2026-04-02 Thread via GitHub
comphead commented on issue #1630: URL: https://github.com/apache/datafusion-comet/issues/1630#issuecomment-4181025858 Comet sets the value from proto for FIRST/LAST ``` AggregateExprBuilder::new(Arc::new(func), vec![child]) .schema(schema)

Re: [PR] deps: upgrade to DataFusion 53.0, Arrow to 58.1 [datafusion-comet]

2026-04-02 Thread via GitHub
mbutrovich commented on PR #3629: URL: https://github.com/apache/datafusion-comet/pull/3629#issuecomment-4181039136 > Most of tests fail on, checking it: > > ``` > Comet native panic: panicked at /usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/datafusion-physical-ex

Re: [I] CI: Add spark expression coverage to build process [datafusion-comet]

2026-04-02 Thread via GitHub
comphead closed issue #281: CI: Add spark expression coverage to build process URL: https://github.com/apache/datafusion-comet/issues/281 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specifi

[PR] chore: add SQL tests for FIRST/LAST aggregates [datafusion-comet]

2026-04-02 Thread via GitHub
comphead opened a new pull request, #3891: URL: https://github.com/apache/datafusion-comet/pull/3891 ## Which issue does this PR close? Closes https://github.com/apache/datafusion-comet/issues/1630. ## Rationale for this change ## Summary

Re: [PR] fix: disable atan2 instead of tan [datafusion-comet]

2026-04-02 Thread via GitHub
comphead commented on code in PR #3849: URL: https://github.com/apache/datafusion-comet/pull/3849#discussion_r3030960460 ## spark/src/main/scala/org/apache/comet/serde/math.scala: ## @@ -19,13 +19,19 @@ package org.apache.comet.serde -import org.apache.spark.sql.catalyst.ex

Re: [PR] fix: disable atan2 instead of tan [datafusion-comet]

2026-04-02 Thread via GitHub
comphead commented on code in PR #3849: URL: https://github.com/apache/datafusion-comet/pull/3849#discussion_r3030960460 ## spark/src/main/scala/org/apache/comet/serde/math.scala: ## @@ -19,13 +19,19 @@ package org.apache.comet.serde -import org.apache.spark.sql.catalyst.ex

[I] [Feature] Support external Remote Shuffle Service (e.g., Apache Celeborn / Apache Uniffle) [datafusion-ballista]

2026-04-02 Thread via GitHub
jja725 opened a new issue, #1539: URL: https://github.com/apache/datafusion-ballista/issues/1539 ## Is your feature request related to a problem or challenge? Ballista currently stores shuffle data on local executor disks and serves it via Arrow Flight between executors. This creates

Re: [PR] fix: audit array_insert expression for correctness and test coverage [datafusion-comet]

2026-04-02 Thread via GitHub
andygrove commented on code in PR #3890: URL: https://github.com/apache/datafusion-comet/pull/3890#discussion_r3031003372 ## docs/source/contributor-guide/expression-audit-log.md: ## @@ -0,0 +1,32 @@ + + +# Expression Audit Log Review Comment: I think it is important to keep

[I] `NoSuchMethodException` when reflecting Iceberg TableOperations.current() [datafusion-comet]

2026-04-02 Thread via GitHub
karuppayya opened a new issue, #3894: URL: https://github.com/apache/datafusion-comet/issues/3894 ### Describe the bug Comet's Iceberg reflection path calls `table.operations().current()`. The current implementation uses `getDeclaredMethod("current")` on the concrete `operations` r

Re: [PR] fix: handle ambiguous and non-existent local times [datafusion-comet]

2026-04-02 Thread via GitHub
parthchandra commented on code in PR #3865: URL: https://github.com/apache/datafusion-comet/pull/3865#discussion_r3031017576 ## native/spark-expr/src/utils.rs: ## @@ -174,6 +174,19 @@ fn datetime_cast_err(value: i64) -> ArrowError { )) } +fn resolve_local_datetime(tz: &T

[PR] Fix Iceberg reflection for current() on TableOperations hierarchy [datafusion-comet]

2026-04-02 Thread via GitHub
karuppayya opened a new pull request, #3895: URL: https://github.com/apache/datafusion-comet/pull/3895 ## Which issue does this PR close? Closes #3894. ## Rationale for this change Fix `NoSuchMethodException` from Iceberg Reflection ## What changes are included i

Re: [PR] fix: disable atan2 instead of tan [datafusion-comet]

2026-04-02 Thread via GitHub
kazuyukitanimura commented on code in PR #3849: URL: https://github.com/apache/datafusion-comet/pull/3849#discussion_r3031032898 ## spark/src/main/scala/org/apache/comet/serde/math.scala: ## @@ -19,13 +19,19 @@ package org.apache.comet.serde -import org.apache.spark.sql.cat

Re: [I] Current shuffle format has too much overhead with default batch size [datafusion-comet]

2026-04-02 Thread via GitHub
mbutrovich commented on issue #3882: URL: https://github.com/apache/datafusion-comet/issues/3882#issuecomment-4181293267 I mentioned this to @andygrove the other day, but applying compression (lz4, snappy, etc.) at the batch granularity is likely too small to get all their benefits. I’d be

Re: [PR] perf: Optimize `split_part` for scalar args [datafusion]

2026-04-02 Thread via GitHub
neilconway commented on code in PR #21238: URL: https://github.com/apache/datafusion/pull/21238#discussion_r3031035588 ## datafusion/functions/src/string/split_part.rs: ## @@ -220,6 +231,190 @@ fn rsplit_nth<'a>(string: &'a str, delimiter: &str, n: usize) -> Option<&'a str>

Re: [PR] fix: disable atan2 instead of tan [datafusion-comet]

2026-04-02 Thread via GitHub
kazuyukitanimura commented on code in PR #3849: URL: https://github.com/apache/datafusion-comet/pull/3849#discussion_r3031050891 ## spark/src/main/scala/org/apache/comet/serde/math.scala: ## @@ -19,13 +19,19 @@ package org.apache.comet.serde -import org.apache.spark.sql.cat

Re: [I] Current shuffle format has too much overhead with default batch size [datafusion-comet]

2026-04-02 Thread via GitHub
karuppayya commented on issue #3882: URL: https://github.com/apache/datafusion-comet/issues/3882#issuecomment-4181214277 @andygrove thanks for creating this issue Adding some more details | Records | Comet Shuffle Write | Standard Shuffle Write | Bytes/Record (Comet) | Byt

[PR] fix sqlite type range mismatch [datafusion-testing]

2026-04-02 Thread via GitHub
xiedeyantu opened a new pull request, #18: URL: https://github.com/apache/datafusion-testing/pull/18 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubsc

Re: [PR] feat: sort file groups by statistics during sort pushdown (Sort pushdown phase 2) [datafusion]

2026-04-02 Thread via GitHub
zhuqi-lucas commented on PR #21182: URL: https://github.com/apache/datafusion/pull/21182#issuecomment-4181485812 Strange β€” I tested locally (release build, --partitions 12 and --partitions 16) and found: 1. **Plans are identical** between main and PR for all 4 queries (SPM β†’ DataSour

Re: [PR] feat: sort file groups by statistics during sort pushdown (Sort pushdown phase 2) [datafusion]

2026-04-02 Thread via GitHub
adriangb commented on PR #21182: URL: https://github.com/apache/datafusion/pull/21182#issuecomment-4181490730 So I guess we need to update the benchmarks? We can also open a PR with no real changes to run the benchmarks. -- This is an automated message from the Apache Git Service. T

Re: [PR] Introduce Morselizer API [datafusion]

2026-04-02 Thread via GitHub
Dandandan commented on PR #21327: URL: https://github.com/apache/datafusion/pull/21327#issuecomment-4181880820 I like how small the PR is! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spec

Re: [PR] feat: comet native scan improvements - Dynamic Partition Pruning [datafusion-comet]

2026-04-02 Thread via GitHub
Shekharrajak commented on PR #3546: URL: https://github.com/apache/datafusion-comet/pull/3546#issuecomment-4181847420 Please trigger the CI checks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go t

Re: [PR] Defer task spawning in SortPreservingMergeExec to first poll [datafusion]

2026-04-02 Thread via GitHub
adriangbot commented on PR #21328: URL: https://github.com/apache/datafusion/pull/21328#issuecomment-4181822009 πŸ€– Benchmark completed (GKE) | [trigger](https://github.com/apache/datafusion/pull/21328#issuecomment-4181752523) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) CPU

Re: [PR] fix: audit array_insert expression for correctness and test coverage [datafusion-comet]

2026-04-02 Thread via GitHub
martin-g commented on PR #3890: URL: https://github.com/apache/datafusion-comet/pull/3890#issuecomment-4181988026 > @kazuyukitanimura @martin-g This PR was created using the audit skill that you reviewed Really cool! -- This is an automated message from the Apache Git Service. To

Re: [PR] feat: sort file groups by statistics during sort pushdown (Sort pushdown phase 2) [datafusion]

2026-04-02 Thread via GitHub
adriangbot commented on PR #21182: URL: https://github.com/apache/datafusion/pull/21182#issuecomment-4182166368 πŸ€– Benchmark completed (GKE) | [trigger](https://github.com/apache/datafusion/pull/21182#issuecomment-4182138276) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) CPU

Re: [PR] feat: support LEAD and LAG window functions with IGNORE NULLS [datafusion-comet]

2026-04-02 Thread via GitHub
viirya commented on PR #3876: URL: https://github.com/apache/datafusion-comet/pull/3876#issuecomment-4182168951 > The PR looks good to me, thanks @viirya may I ask you to add sql tests like in #3891 Thanks for review. I will try to add sql tests. -- This is an automated message fr

Re: [PR] feat: sort file groups by statistics during sort pushdown (Sort pushdown phase 2) [datafusion]

2026-04-02 Thread via GitHub
zhuqi-lucas commented on PR #21182: URL: https://github.com/apache/datafusion/pull/21182#issuecomment-4182101292 Update: found the root cause of Q1/Q3 regression and a fix. **Root cause**: `SortPreservingMergeExec` uses `spawn_buffered(stream, 1)` β€” only 1 batch prefetched per partiti

Re: [PR] Defer task spawning in RepartitionExec to first poll [datafusion]

2026-04-02 Thread via GitHub
Dandandan commented on PR #21330: URL: https://github.com/apache/datafusion/pull/21330#issuecomment-4182100364 run benchmarks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the speci

Re: [PR] feat: sort file groups by statistics during sort pushdown (Sort pushdown phase 2) [datafusion]

2026-04-02 Thread via GitHub
zhuqi-lucas commented on PR #21182: URL: https://github.com/apache/datafusion/pull/21182#issuecomment-4182138276 run benchmarks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific commen

Re: [PR] feat: sort file groups by statistics during sort pushdown (Sort pushdown phase 2) [datafusion]

2026-04-02 Thread via GitHub
adriangbot commented on PR #21182: URL: https://github.com/apache/datafusion/pull/21182#issuecomment-4182141591 πŸ€– Benchmark running (GKE) | [trigger](https://github.com/apache/datafusion/pull/21182#issuecomment-4182138276) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux bench-

Re: [PR] Defer task spawning in RepartitionExec to first poll [datafusion]

2026-04-02 Thread via GitHub
adriangbot commented on PR #21330: URL: https://github.com/apache/datafusion/pull/21330#issuecomment-4182140849 πŸ€– Benchmark completed (GKE) | [trigger](https://github.com/apache/datafusion/pull/21330#issuecomment-4182100364) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) CPU

[PR] feat: Add Spark-compatible `encode` function to datafusion-spark [datafusion]

2026-04-02 Thread via GitHub
JeelRajodiya opened a new pull request, #21331: URL: https://github.com/apache/datafusion/pull/21331 **Rationale** The `datafusion-spark` crate is missing the `encode` function. Spark's [`encode(expr, charset)`](https://spark.apache.org/docs/latest/api/sql/index.html#encode) convert

Re: [PR] feat: sort file groups by statistics during sort pushdown (Sort pushdown phase 2) [datafusion]

2026-04-02 Thread via GitHub
adriangbot commented on PR #21182: URL: https://github.com/apache/datafusion/pull/21182#issuecomment-4182189680 πŸ€– Benchmark completed (GKE) | [trigger](https://github.com/apache/datafusion/pull/21182#issuecomment-4182138276) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) CPU

Re: [PR] feat: sort file groups by statistics during sort pushdown (Sort pushdown phase 2) [datafusion]

2026-04-02 Thread via GitHub
adriangbot commented on PR #21182: URL: https://github.com/apache/datafusion/pull/21182#issuecomment-4182191418 πŸ€– Benchmark completed (GKE) | [trigger](https://github.com/apache/datafusion/pull/21182#issuecomment-4182138276) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) CPU

Re: [PR] Defer task spawning in RepartitionExec to first poll [datafusion]

2026-04-02 Thread via GitHub
adriangbot commented on PR #21330: URL: https://github.com/apache/datafusion/pull/21330#issuecomment-4182108390 πŸ€– Benchmark running (GKE) | [trigger](https://github.com/apache/datafusion/pull/21330#issuecomment-4182100364) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux bench-

Re: [PR] feat: sort file groups by statistics during sort pushdown (Sort pushdown phase 2) [datafusion]

2026-04-02 Thread via GitHub
adriangbot commented on PR #21182: URL: https://github.com/apache/datafusion/pull/21182#issuecomment-4182124608 πŸ€– Benchmark running (GKE) | [trigger](https://github.com/apache/datafusion/pull/21182#issuecomment-4182117219) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux bench-

Re: [PR] Defer task spawning in RepartitionExec to first poll [datafusion]

2026-04-02 Thread via GitHub
adriangbot commented on PR #21330: URL: https://github.com/apache/datafusion/pull/21330#issuecomment-4182108145 πŸ€– Benchmark running (GKE) | [trigger](https://github.com/apache/datafusion/pull/21330#issuecomment-4182100364) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux bench-

Re: [PR] Defer task spawning in RepartitionExec to first poll [datafusion]

2026-04-02 Thread via GitHub
adriangbot commented on PR #21330: URL: https://github.com/apache/datafusion/pull/21330#issuecomment-4182108274 πŸ€– Benchmark running (GKE) | [trigger](https://github.com/apache/datafusion/pull/21330#issuecomment-4182100364) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux bench-

Re: [PR] Defer task spawning in RepartitionExec to first poll [datafusion]

2026-04-02 Thread via GitHub
adriangbot commented on PR #21330: URL: https://github.com/apache/datafusion/pull/21330#issuecomment-4182158396 πŸ€– Benchmark completed (GKE) | [trigger](https://github.com/apache/datafusion/pull/21330#issuecomment-4182100364) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) CPU

Re: [PR] Defer task spawning in RepartitionExec to first poll [datafusion]

2026-04-02 Thread via GitHub
adriangbot commented on PR #21330: URL: https://github.com/apache/datafusion/pull/21330#issuecomment-4182158896 πŸ€– Benchmark completed (GKE) | [trigger](https://github.com/apache/datafusion/pull/21330#issuecomment-4182100364) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) CPU

Re: [PR] feat: sort file groups by statistics during sort pushdown (Sort pushdown phase 2) [datafusion]

2026-04-02 Thread via GitHub
adriangbot commented on PR #21182: URL: https://github.com/apache/datafusion/pull/21182#issuecomment-4182157972 πŸ€– Benchmark completed (GKE) | [trigger](https://github.com/apache/datafusion/pull/21182#issuecomment-4182117219) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) CPU

[I] Defer task spawning in SortPreservingMergeExec to first poll [datafusion]

2026-04-02 Thread via GitHub
Dandandan opened a new issue, #21329: URL: https://github.com/apache/datafusion/issues/21329 ### Is your feature request related to a problem or challenge? SortPreservingMergeExec::execute() eagerly calls execute() on all input partitions and spawns buffered tasks immediately, befor

[PR] Defer task spawning in RepartitionExec to first poll [datafusion]

2026-04-02 Thread via GitHub
Dandandan opened a new pull request, #21330: URL: https://github.com/apache/datafusion/pull/21330 ## Which issue does this PR close? ## Rationale for this change Currently, `RepartitionExec::execute()` eagerly calls `ensure_input_streams_initialized()` which opens all i

Re: [PR] Defer task spawning in SortPreservingMergeExec to first poll [datafusion]

2026-04-02 Thread via GitHub
Dandandan commented on PR #21328: URL: https://github.com/apache/datafusion/pull/21328#issuecomment-4182041172 cc @neilconway -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment

Re: [PR] Defer task spawning in RepartitionExec to first poll [datafusion]

2026-04-02 Thread via GitHub
Dandandan commented on PR #21330: URL: https://github.com/apache/datafusion/pull/21330#issuecomment-4182039448 run benchmarks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] Defer task spawning in RepartitionExec to first poll [datafusion]

2026-04-02 Thread via GitHub
adriangbot commented on PR #21330: URL: https://github.com/apache/datafusion/pull/21330#issuecomment-4182047049 Benchmark for [this request](https://github.com/apache/datafusion/pull/21330#issuecomment-4182039448) failed. Last 20 lines of output: Click to expand ``` * [

Re: [PR] Defer task spawning in RepartitionExec to first poll [datafusion]

2026-04-02 Thread via GitHub
adriangbot commented on PR #21330: URL: https://github.com/apache/datafusion/pull/21330#issuecomment-4182047804 Benchmark for [this request](https://github.com/apache/datafusion/pull/21330#issuecomment-4182039448) failed. Last 20 lines of output: Click to expand ``` * [

Re: [PR] Defer task spawning in RepartitionExec to first poll [datafusion]

2026-04-02 Thread via GitHub
adriangbot commented on PR #21330: URL: https://github.com/apache/datafusion/pull/21330#issuecomment-4182047170 Benchmark for [this request](https://github.com/apache/datafusion/pull/21330#issuecomment-4182039448) failed. Last 20 lines of output: Click to expand ``` * [

Re: [PR] feat: sort file groups by statistics during sort pushdown (Sort pushdown phase 2) [datafusion]

2026-04-02 Thread via GitHub
zhuqi-lucas commented on PR #21182: URL: https://github.com/apache/datafusion/pull/21182#issuecomment-4182117219 run benchmark sort_pushdown_sorted -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] feat: sort file groups by statistics during sort pushdown (Sort pushdown phase 2) [datafusion]

2026-04-02 Thread via GitHub
adriangbot commented on PR #21182: URL: https://github.com/apache/datafusion/pull/21182#issuecomment-4182144813 πŸ€– Benchmark running (GKE) | [trigger](https://github.com/apache/datafusion/pull/21182#issuecomment-4182138276) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux bench-

Re: [PR] feat: sort file groups by statistics during sort pushdown (Sort pushdown phase 2) [datafusion]

2026-04-02 Thread via GitHub
adriangbot commented on PR #21182: URL: https://github.com/apache/datafusion/pull/21182#issuecomment-4182144805 πŸ€– Benchmark running (GKE) | [trigger](https://github.com/apache/datafusion/pull/21182#issuecomment-4182138276) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux bench-

Re: [PR] Defer task spawning in RepartitionExec to first poll [datafusion]

2026-04-02 Thread via GitHub
Dandandan commented on PR #21330: URL: https://github.com/apache/datafusion/pull/21330#issuecomment-4182198021 Ok - this is currently slower for plans with limited concurrency (tpcds), perhaps slightly better for `clickbench_partitioned` I think we can wait until morsel-splitting and see if

Re: [PR] Defer task spawning in RepartitionExec to first poll [datafusion]

2026-04-02 Thread via GitHub
Dandandan closed pull request #21330: Defer task spawning in RepartitionExec to first poll URL: https://github.com/apache/datafusion/pull/21330 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the sp

Re: [PR] fix: disable atan2 instead of tan [datafusion-comet]

2026-04-02 Thread via GitHub
andygrove commented on code in PR #3849: URL: https://github.com/apache/datafusion-comet/pull/3849#discussion_r3028427152 ## spark/src/main/scala/org/apache/comet/serde/math.scala: ## @@ -213,24 +219,6 @@ object CometAbs extends CometExpressionSerde[Abs] with MathExprBase {

Re: [PR] feat: Support Spark expression: percentile_cont [datafusion-comet]

2026-04-02 Thread via GitHub
andygrove commented on PR #3757: URL: https://github.com/apache/datafusion-comet/pull/3757#issuecomment-4178421165 @YutaLin could you run `cargo fmt --all` to fix lint failures -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [PR] Skip probe-side consumption when hash join build side is empty [datafusion]

2026-04-02 Thread via GitHub
adriangb commented on PR #21068: URL: https://github.com/apache/datafusion/pull/21068#issuecomment-4178283016 Thanks @kosiew ! Feel free to merge. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] Feat: to_json Infinity/-Infinity Nan values support [datafusion-comet]

2026-04-02 Thread via GitHub
andygrove commented on code in PR #3875: URL: https://github.com/apache/datafusion-comet/pull/3875#discussion_r3028384790 ## spark/src/main/scala/org/apache/comet/serde/structs.scala: ## @@ -105,53 +105,37 @@ object CometGetArrayStructFields extends CometExpressionSerde[GetArra

Re: [PR] Feat: to_json Infinity/-Infinity Nan values support [datafusion-comet]

2026-04-02 Thread via GitHub
andygrove commented on code in PR #3875: URL: https://github.com/apache/datafusion-comet/pull/3875#discussion_r3028390361 ## spark/src/main/scala/org/apache/comet/serde/structs.scala: ## @@ -105,53 +105,37 @@ object CometGetArrayStructFields extends CometExpressionSerde[GetArra

Re: [PR] fix: disable atan2 instead of tan [datafusion-comet]

2026-04-02 Thread via GitHub
andygrove commented on code in PR #3849: URL: https://github.com/apache/datafusion-comet/pull/3849#discussion_r3028420858 ## spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala: ## @@ -1346,7 +1345,7 @@ class CometExpressionSuite extends CometTestBase with Adaptive

Re: [PR] test: ceil and floor works correctly for Decimal128 [datafusion-comet]

2026-04-02 Thread via GitHub
andygrove commented on code in PR #3848: URL: https://github.com/apache/datafusion-comet/pull/3848#discussion_r3028473095 ## spark/src/test/resources/sql-tests/expressions/math/ceil.sql: ## @@ -15,7 +15,6 @@ -- specific language governing permissions and limitations -- under t

Re: [PR] feat: Support Spark expression: percentile_cont [datafusion-comet]

2026-04-02 Thread via GitHub
andygrove commented on PR #3757: URL: https://github.com/apache/datafusion-comet/pull/3757#issuecomment-4178458553 Some test suggestions for edge cases that could reveal incompatibilities. The main risk is that Comet casts all inputs to f64 before accumulation, while Spark stores original

Re: [PR] feat: move shuffle writer disk I/O off tokio worker threads [datafusion-ballista]

2026-04-02 Thread via GitHub
hcrosse commented on code in PR #1537: URL: https://github.com/apache/datafusion-ballista/pull/1537#discussion_r3028594421 ## ballista/core/src/utils.rs: ## @@ -159,42 +159,74 @@ pub fn default_config_producer() -> SessionConfig { SessionConfig::new_with_ballista() } -/

Re: [I] How to properly measure off-heap memory usage for Comet? [datafusion-comet]

2026-04-02 Thread via GitHub
andygrove closed issue #1894: How to properly measure off-heap memory usage for Comet? URL: https://github.com/apache/datafusion-comet/issues/1894 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [I] How to properly measure off-heap memory usage for Comet? [datafusion-comet]

2026-04-02 Thread via GitHub
andygrove commented on issue #1894: URL: https://github.com/apache/datafusion-comet/issues/1894#issuecomment-4178475112 Here are a few approaches depending on what you're trying to measure: ### Measuring Comet's off-heap memory usage **Option 1: Tracing with jemalloc (recommend

Re: [PR] feat: move shuffle writer disk I/O off tokio worker threads [datafusion-ballista]

2026-04-02 Thread via GitHub
hcrosse commented on code in PR #1537: URL: https://github.com/apache/datafusion-ballista/pull/1537#discussion_r3028568797 ## ballista/core/src/execution_plans/shuffle_writer.rs: ## @@ -255,96 +252,114 @@ impl ShuffleWriterExec { } Some(Parti

Re: [PR] feat: move shuffle writer disk I/O off tokio worker threads [datafusion-ballista]

2026-04-02 Thread via GitHub
hcrosse commented on code in PR #1537: URL: https://github.com/apache/datafusion-ballista/pull/1537#discussion_r3028649892 ## benchmarks/src/bin/shuffle_bench.rs: ## @@ -240,19 +275,53 @@ async fn benchmark_sort_shuffle( output_partitions, ), conf

Re: [PR] feat: move shuffle writer disk I/O off tokio worker threads [datafusion-ballista]

2026-04-02 Thread via GitHub
hcrosse commented on code in PR #1537: URL: https://github.com/apache/datafusion-ballista/pull/1537#discussion_r3028654374 ## ballista/core/src/execution_plans/shuffle_writer.rs: ## @@ -214,14 +214,12 @@ impl ShuffleWriterExec { match output_partitioning {

Re: [PR] Add configurable UNION DISTINCT to FILTER rewrite optimization [datafusion]

2026-04-02 Thread via GitHub
comphead commented on PR #21075: URL: https://github.com/apache/datafusion/pull/21075#issuecomment-4178577955 Thanks @xiedeyantu I'll take a look this week, would be super useful for users and also for regression to have internal microbenchmarks, similar to `datafusion/core/benches/push_dow

Re: [PR] chore: add `.claude/settings.local.json` to `.gitignore` [datafusion]

2026-04-02 Thread via GitHub
jonahgao commented on PR #21312: URL: https://github.com/apache/datafusion/pull/21312#issuecomment-4178579399 Thank you @2010YOUY01 for the review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] chore: add `.claude/settings.local.json` to `.gitignore` [datafusion]

2026-04-02 Thread via GitHub
jonahgao merged PR #21312: URL: https://github.com/apache/datafusion/pull/21312 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dataf

[PR] feat: add cast_to_type UDF for type-based casting [datafusion]

2026-04-02 Thread via GitHub
adriangb opened a new pull request, #21322: URL: https://github.com/apache/datafusion/pull/21322 ## Which issue does this PR close? N/A β€” new feature ## Rationale for this change DuckDB provides a [`cast_to_type(expression, reference)`](https://duckdb.org/docs/current/sq

Re: [PR] [docs] add sql example to timestamp/datetime docs for time zone [datafusion]

2026-04-02 Thread via GitHub
buraksenn commented on PR #21082: URL: https://github.com/apache/datafusion/pull/21082#issuecomment-4178598612 > lgtm, thanks @buraksenn > > Might be worth double-checking a couple of things: > > 1. `current_time()`: Europe/London in December should be UTC+0, so the +1h off

Re: [PR] feat: generate reversed-name data for sort pushdown benchmark [datafusion]

2026-04-02 Thread via GitHub
zhuqi-lucas commented on PR #21266: URL: https://github.com/apache/datafusion/pull/21266#issuecomment-4175208553 @adriangb Updated! Much simpler now β€” just `tpchgen --parts=3` + `mv` to rename files. No datafusion-cli needed. The reversed naming produces clear benchmark differences (r

Re: [PR] Allow Spark partial / Comet final for compatible aggregates [datafusion-comet]

2026-04-02 Thread via GitHub
Shekharrajak commented on PR #2994: URL: https://github.com/apache/datafusion-comet/pull/2994#issuecomment-4175310971 Found issue in CI checks : https://github.com/apache/datafusion-comet/issues/3881 -- This is an automated message from the Apache Git Service. To respond to the message,

[I] Add ExistenceJoin support to Comet native execution [datafusion-comet]

2026-04-02 Thread via GitHub
Shekharrajak opened a new issue, #3881: URL: https://github.com/apache/datafusion-comet/issues/3881 ### What is the problem the feature request solves? Comet does not support ExistenceJoin, causing incorrect results for correlated IN subqueries combined with OR on Spark 4.0. Adding na

Re: [PR] perf: Implement physical execution of uncorrelated scalar subqueries [datafusion]

2026-04-02 Thread via GitHub
Dandandan commented on PR #21240: URL: https://github.com/apache/datafusion/pull/21240#issuecomment-4175309833 > > In principle self.right.execute only builds the stream - it shouldn't do any "actual" work, only the setup > > Is that true in practice? e.g., > > * CoalescePartit

Re: [PR] feat: Additional Canonical Extension Types [datafusion]

2026-04-02 Thread via GitHub
tobixdev commented on code in PR #21291: URL: https://github.com/apache/datafusion/pull/21291#discussion_r3026627995 ## datafusion/common/src/types/canonical_extensions/bool8.rs: ## @@ -0,0 +1,133 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more cont

[I] Duplicate `GROUPING SETS` rows are incorrectly collapsed during execution [datafusion]

2026-04-02 Thread via GitHub
xiedeyantu opened a new issue, #21316: URL: https://github.com/apache/datafusion/issues/21316 ### Describe the bug When `GROUPING SETS` contains duplicate grouping lists, DataFusion incorrectly collapses them during execution. The internal `grouping_id` only encodes the semantic null

Re: [PR] chore(deps): bump jni from 0.21.1 to 0.22.4 in /native [datafusion-comet]

2026-04-02 Thread via GitHub
manuzhang commented on code in PR #3753: URL: https://github.com/apache/datafusion-comet/pull/3753#discussion_r3027533477 ## native/core/src/execution/jni_api.rs: ## @@ -778,33 +778,31 @@ pub unsafe extern "system" fn Java_org_apache_comet_Native_writeSortedFileNative comp

Re: [I] Duplicate `GROUPING SETS` rows are incorrectly collapsed during execution [datafusion]

2026-04-02 Thread via GitHub
xiedeyantu commented on issue #21316: URL: https://github.com/apache/datafusion/issues/21316#issuecomment-4176949882 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] chore(deps): bump jni from 0.21.1 to 0.22.4 in /native [datafusion-comet]

2026-04-02 Thread via GitHub
manuzhang commented on code in PR #3753: URL: https://github.com/apache/datafusion-comet/pull/3753#discussion_r3027533477 ## native/core/src/execution/jni_api.rs: ## @@ -778,33 +778,31 @@ pub unsafe extern "system" fn Java_org_apache_comet_Native_writeSortedFileNative comp

Re: [I] Duplicate `GROUPING SETS` rows are incorrectly collapsed during execution [datafusion]

2026-04-02 Thread via GitHub
xiedeyantu commented on issue #21316: URL: https://github.com/apache/datafusion/issues/21316#issuecomment-4177000259 Hi @alamb @neilconway, I have logged an issue​ here to describe this bug. If you have time, please review it. Thanks! -- This is an automated message from the Apache Git Se

Re: [PR] DataFusion 53 Release Blog [datafusion-site]

2026-04-02 Thread via GitHub
xudong963 commented on code in PR #162: URL: https://github.com/apache/datafusion-site/pull/162#discussion_r3026588135 ## content/blog/2026-03-25-datafusion-53.0.0.md: ## @@ -0,0 +1,403 @@ +--- +layout: post +title: Apache DataFusion 53.0.0 Released +date: 2026-03-25 Review Com

Re: [PR] chore(deps): bump runs-on/action from 2.0.3 to 2.1.0 [datafusion]

2026-04-02 Thread via GitHub
blaginin merged PR #21134: URL: https://github.com/apache/datafusion/pull/21134 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dataf

Re: [PR] Adds INList and Between expr to skip outer join [datafusion]

2026-04-02 Thread via GitHub
SubhamSinghal commented on code in PR #21303: URL: https://github.com/apache/datafusion/pull/21303#discussion_r3027721541 ## datafusion/optimizer/src/eliminate_outer_join.rs: ## @@ -436,6 +454,221 @@ mod tests { ") } +#[test] +fn eliminate_left_with_in_li

[I] Current shuffle format has too much overhead with default batch size [datafusion-comet]

2026-04-02 Thread via GitHub
andygrove opened a new issue, #3882: URL: https://github.com/apache/datafusion-comet/issues/3882 ### Describe the bug The current shuffle format writes each batch using the Arrow IPC Stream format, writing a single batch per stream instance, which means that the schema is encoded for

[I] Sort pushdown: reorder row groups by statistics within each file [datafusion]

2026-04-02 Thread via GitHub
zhuqi-lucas opened a new issue, #21317: URL: https://github.com/apache/datafusion/issues/21317 **Is your feature request related to a problem or challenge?** Currently sort pushdown reorders **files** by min/max statistics to achieve sort elimination. But within each file, row groups

Re: [PR] feat: sort file groups by statistics during sort pushdown (Sort pushdown phase 2) [datafusion]

2026-04-02 Thread via GitHub
zhuqi-lucas commented on code in PR #21182: URL: https://github.com/apache/datafusion/pull/21182#discussion_r3027914806 ## datafusion/datasource-parquet/src/source.rs: ## @@ -811,11 +819,6 @@ impl FileSource for ParquetSource { Ok(SortOrderPushdownResult::Inexact {

Re: [PR] perf: Optimize `split_part` for scalar args [datafusion]

2026-04-02 Thread via GitHub
neilconway commented on PR #21238: URL: https://github.com/apache/datafusion/pull/21238#issuecomment-4177731241 @martin-g Any interest in reviewing this PR? It's a follow-on to the initial `split_work` work that was done in #21119 -- This is an automated message from the Apache Git Servic

Re: [I] CaseWhen does not work with custom implemented column expression [datafusion]

2026-04-02 Thread via GitHub
alamb commented on issue #21231: URL: https://github.com/apache/datafusion/issues/21231#issuecomment-4177839220 From what I can tell, the core issue is that CaseWhen currently assumes input column references can be discovered by finding built-in `Column` physical exprs. That is not true for

Re: [PR] feat: enable native_datafusion scan in auto mode [datafusion-comet]

2026-04-02 Thread via GitHub
andygrove commented on code in PR #3781: URL: https://github.com/apache/datafusion-comet/pull/3781#discussion_r3028168645 ## spark/src/test/scala/org/apache/comet/parquet/ParquetReadSuite.scala: ## @@ -981,9 +981,12 @@ abstract class ParquetReadSuite extends CometTestBase {

Re: [PR] ensure dynamic filters are correctly pushed down through aggregations [datafusion]

2026-04-02 Thread via GitHub
adriangb merged PR #21059: URL: https://github.com/apache/datafusion/pull/21059 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dataf

Re: [I] Dynamic filters sometimes do not get pushed down through aggregations [datafusion]

2026-04-02 Thread via GitHub
adriangb closed issue #21065: Dynamic filters sometimes do not get pushed down through aggregations URL: https://github.com/apache/datafusion/issues/21065 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to g

[I] PropagateEmptyRelation does not eliminate outer joins when one side is empty [datafusion]

2026-04-02 Thread via GitHub
SubhamSinghal opened a new issue, #21320: URL: https://github.com/apache/datafusion/issues/21320 ### Is your feature request related to a problem or challenge? The `PropagateEmptyRelation` optimizer rule correctly handles inner joins, semi joins, and anti joins when one or both side

Re: [PR] feat: sort file groups by statistics during sort pushdown (Sort pushdown phase 2) [datafusion]

2026-04-02 Thread via GitHub
adriangb commented on PR #21182: URL: https://github.com/apache/datafusion/pull/21182#issuecomment-4178037387 run benchmark sort_pushdown_sorted -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to th

Re: [PR] feat: sort file groups by statistics during sort pushdown (Sort pushdown phase 2) [datafusion]

2026-04-02 Thread via GitHub
adriangbot commented on PR #21182: URL: https://github.com/apache/datafusion/pull/21182#issuecomment-4178061675 πŸ€– Benchmark running (GKE) | [trigger](https://github.com/apache/datafusion/pull/21182#issuecomment-4178037387) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux bench-

Re: [PR] feat: sort file groups by statistics during sort pushdown (Sort pushdown phase 2) [datafusion]

2026-04-02 Thread via GitHub
adriangbot commented on PR #21182: URL: https://github.com/apache/datafusion/pull/21182#issuecomment-4178115627 πŸ€– Benchmark completed (GKE) | [trigger](https://github.com/apache/datafusion/pull/21182#issuecomment-4178037387) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) CPU

Re: [PR] doc: GetArrayItem is now supported [datafusion-comet]

2026-04-02 Thread via GitHub
andygrove merged PR #3880: URL: https://github.com/apache/datafusion-comet/pull/3880 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@

Re: [PR] feat: enable native_datafusion scan in auto mode [datafusion-comet]

2026-04-02 Thread via GitHub
andygrove commented on PR #3781: URL: https://github.com/apache/datafusion-comet/pull/3781#issuecomment-4178630280 @parthchandra @comphead @mbutrovich Thanks for the feedback so far. I simplified this PR so that `auto` mode now chooses `native_datafusion` **instead of** `native_iceberg_com

Re: [PR] fix: preserve duplicate GROUPING SETS rows [datafusion]

2026-04-02 Thread via GitHub
xiedeyantu commented on PR #21058: URL: https://github.com/apache/datafusion/pull/21058#issuecomment-4178647113 > Sorry for the delayed response, @xiedeyantu ! > > Thanks for revising this. I'm a bit concerned by the overhead here; we are added a `UInt32` column to _every_ query with

  1   2   3   4   >