[I] Partition values are not URL-decoded when extracted from Hive-style paths [datafusion]

2026-01-05 Thread via GitHub
greedAuguria opened a new issue, #19650: URL: https://github.com/apache/datafusion/issues/19650 ### Describe the bug When using Hive-style partitioned tables where partition values contain URL-encoded characters (like `/` encoded as `%2F` or spaces as `%20`), DataFusion returns the l

[PR] fix: Percent Encoding of paths for hive style partitioning [datafusion]

2026-01-05 Thread via GitHub
greedAuguria opened a new pull request, #19651: URL: https://github.com/apache/datafusion/pull/19651 ## Which issue does this PR close? - Closes #19650. ## Rationale for this change Currently, when DataFusion parses Hive-style partitioned paths (e.g., `s3://bucket/table/

Re: [PR] Use code points instead of grapheme clusters for string functions [datafusion]

2026-01-05 Thread via GitHub
alamb commented on PR #3054: URL: https://github.com/apache/datafusion/pull/3054#issuecomment-3710473820 I don't know that there was any reason not to include `lpad` / `rpad` (I don't think it was a deliberate choice) -- This is an automated message from the Apache Git Service. To respond

Re: [I] Confusing behavior now required to to refresh the files of a listing table [datafusion]

2026-01-05 Thread via GitHub
alamb commented on issue #19573: URL: https://github.com/apache/datafusion/issues/19573#issuecomment-3710480077 > > I have a draft PR [#19616](https://github.com/apache/datafusion/pull/19616) for one approach I considered, which is to continue using a session level cache as is currently

Re: [PR] chore: update ballista version to 51.0.0 (from 50.0.0) [datafusion-ballista]

2026-01-05 Thread via GitHub
milenkovicm merged PR #1363: URL: https://github.com/apache/datafusion-ballista/pull/1363 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubsc

Re: [PR] chore: update ballista version to 51.0.0 (from 50.0.0) [datafusion-ballista]

2026-01-05 Thread via GitHub
milenkovicm commented on PR #1363: URL: https://github.com/apache/datafusion-ballista/pull/1363#issuecomment-3710479860 thanks lads, will merge this one -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above t

Re: [PR] table scoped lfc [datafusion]

2026-01-05 Thread via GitHub
alamb commented on code in PR #19616: URL: https://github.com/apache/datafusion/pull/19616#discussion_r2661565794 ## datafusion/core/src/execution/context/mod.rs: ## @@ -1327,12 +1329,34 @@ impl SessionContext { && table_provider.table_type() == table_type

Re: [PR] Allow dropping qualified columns [datafusion]

2026-01-05 Thread via GitHub
ntjohnson1 commented on PR #19549: URL: https://github.com/apache/datafusion/pull/19549#issuecomment-3710556356 > now we are changing/improving the behavior of `drop_columns`, we should probably update the documentation (wherever the right place is?). I mean after this PR `drop_columns` now

Re: [I] Andrew Lamb Weekly-ish Open Source plan - 2025-12-08 [datafusion]

2026-01-05 Thread via GitHub
alamb commented on issue #19210: URL: https://github.com/apache/datafusion/issues/19210#issuecomment-3710426223 Next one: - https://github.com/apache/datafusion/issues/19652 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

[I] Andrew Lamb Weekly-ish Open Source plan - 2026-01-05 [datafusion]

2026-01-05 Thread via GitHub
alamb opened a new issue, #19652: URL: https://github.com/apache/datafusion/issues/19652 This is my weekly plan, mostly for my own organizational need. I am making it public in the hopes that helps others to see what I am working on -- also I spend so much time in github the interface is v

Re: [I] Andrew Lamb Weekly-ish Open Source plan - 2025-12-08 [datafusion]

2026-01-05 Thread via GitHub
alamb closed issue #19210: Andrew Lamb Weekly-ish Open Source plan - 2025-12-08 URL: https://github.com/apache/datafusion/issues/19210 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific com

Re: [PR] Do not convert pyarrow scalar values to plain python types when passing as `lit` [datafusion-python]

2026-01-05 Thread via GitHub
timsaucer merged PR #1319: URL: https://github.com/apache/datafusion-python/pull/1319 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...

Re: [PR] fix: use coalesce instead of drop_duplicate_keys for join [datafusion-python]

2026-01-05 Thread via GitHub
timsaucer merged PR #1318: URL: https://github.com/apache/datafusion-python/pull/1318 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...

Re: [I] Full join on dataframe with only index yields dropped rows [datafusion-python]

2026-01-05 Thread via GitHub
timsaucer closed issue #1305: Full join on dataframe with only index yields dropped rows URL: https://github.com/apache/datafusion-python/issues/1305 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] fix: use coalesce instead of drop_duplicate_keys for join [datafusion-python]

2026-01-05 Thread via GitHub
timsaucer commented on PR #1318: URL: https://github.com/apache/datafusion-python/pull/1318#issuecomment-3710412009 I updated the description because I don't think this is a breaking change since the `drop_duplicate_keys` wasn't released. -- This is an automated message from the Apache G

Re: [PR] fix: Disallows dropping duplicate keys when using full outer join [datafusion-python]

2026-01-05 Thread via GitHub
timsaucer commented on PR #1320: URL: https://github.com/apache/datafusion-python/pull/1320#issuecomment-3710417582 Closing this PR since the consensus has landed on using the coalesce approach instead. Thank you for the PR and helpful discussions! -- This is an automated message from th

Re: [PR] fix: Disallows dropping duplicate keys when using full outer join [datafusion-python]

2026-01-05 Thread via GitHub
timsaucer closed pull request #1320: fix: Disallows dropping duplicate keys when using full outer join URL: https://github.com/apache/datafusion-python/pull/1320 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL ab

Re: [I] Running Clickbench query 18 fails with "failed to fill whole buffer" error [datafusion]

2026-01-05 Thread via GitHub
alamb commented on issue #19425: URL: https://github.com/apache/datafusion/issues/19425#issuecomment-3710429984 Realistically I am not planning on working on this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] perf: Improve performance of `split_part` [datafusion]

2026-01-05 Thread via GitHub
martin-g commented on code in PR #19570: URL: https://github.com/apache/datafusion/pull/19570#discussion_r2661528785 ## datafusion/functions/src/string/split_part.rs: ## @@ -219,22 +219,32 @@ where .try_for_each(|((string, delimiter), n)| -> Result<(), DataFusionError>

Re: [I] Improve performance of `in_list` expressions [datafusion-comet]

2026-01-05 Thread via GitHub
Brijesh-Thakkar commented on issue #3027: URL: https://github.com/apache/datafusion-comet/issues/3027#issuecomment-3711082419 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specifi

Re: [PR] feat: support `substrait_plan` and remove deprecated `sql` field from `ExecuteQueryParams.Query` [datafusion-ballista]

2026-01-05 Thread via GitHub
mattcuento commented on code in PR #1360: URL: https://github.com/apache/datafusion-ballista/pull/1360#discussion_r2662024965 ## Cargo.toml: ## @@ -39,6 +39,7 @@ datafusion = "51.0.0" datafusion-cli = "51.0.0" datafusion-proto = "51.0.0" datafusion-proto-common = "51.0.0" +d

Re: [I] Fail the build if public method is missing rustdoc [datafusion-ballista]

2026-01-05 Thread via GitHub
killzoner commented on issue #1258: URL: https://github.com/apache/datafusion-ballista/issues/1258#issuecomment-3711102969 Fresh from the stochastic parrot https://github.com/apache/datafusion-ballista/pull/1364 -- This is an automated message from the Apache Git Service. To respond to t

[PR] feat: add missing public API documentation/comments [datafusion-ballista]

2026-01-05 Thread via GitHub
killzoner opened a new pull request, #1364: URL: https://github.com/apache/datafusion-ballista/pull/1364 # Which issue does this PR close? Closes https://github.com/apache/datafusion-ballista/issues/1258 # Rationale for this change We don't want undocumented

Re: [I] Offest parquet pushdown [datafusion]

2026-01-05 Thread via GitHub
AntoinePrv commented on issue #19654: URL: https://github.com/apache/datafusion/issues/19654#issuecomment-3711428561 @alamb I'm taking the liberty to ping you here since you seem to be working on similar issues lately. -- This is an automated message from the Apache Git Service. To respon

Re: [PR] perf: Improve string to int perf [datafusion-comet]

2026-01-05 Thread via GitHub
coderfender commented on code in PR #3017: URL: https://github.com/apache/datafusion-comet/pull/3017#discussion_r2662319466 ## native/spark-expr/src/conversion_funcs/cast.rs: ## @@ -1957,41 +1967,46 @@ fn cast_string_to_int_with_range_check( /// Equivalent to /// - org.apache.

Re: [I] Attach `Diagnostic` to "invalid function argument types" error [datafusion]

2026-01-05 Thread via GitHub
alamb commented on issue #14431: URL: https://github.com/apache/datafusion/issues/14431#issuecomment-3711468279 I don't think so I don't think @eliaperantoni is actively working on this area any more I am not sure if @kumarUjjawal is doing so either -- This is an automated

Re: [I] Proposal: Prune complex predicates by propagating column statistics [datafusion]

2026-01-05 Thread via GitHub
alamb commented on issue #19487: URL: https://github.com/apache/datafusion/issues/19487#issuecomment-3712022437 I'll give it a review shortly -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the s

Re: [I] Use pipeline aggregation when data is implicitly sorted by group-by keys [datafusion]

2026-01-05 Thread via GitHub
xavlee commented on issue #19655: URL: https://github.com/apache/datafusion/issues/19655#issuecomment-3712011265 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To uns

Re: [PR] Optimize `Nullstate` / accumulators [datafusion]

2026-01-05 Thread via GitHub
alamb commented on code in PR #19625: URL: https://github.com/apache/datafusion/pull/19625#discussion_r2662746848 ## datafusion/functions-aggregate-common/src/aggregate/groups_accumulator/accumulate.rs: ## @@ -59,6 +59,10 @@ pub struct NullState { /// If `seen_values[i]` is

Re: [I] Enable parquet filter pushdown (`filter_pushdown`) by default [datafusion]

2026-01-05 Thread via GitHub
alamb commented on issue #3463: URL: https://github.com/apache/datafusion/issues/3463#issuecomment-3712034479 > Oh right, yes it will do that sorry, been years since I wrote that code (and it looks like there's some new PushDecoder anyway that might change all of this). FWIW the pus

Re: [PR] Emit aggregation groups in chunks to avoid blocking async runtime [datafusion]

2026-01-05 Thread via GitHub
alamb commented on code in PR #18906: URL: https://github.com/apache/datafusion/pull/18906#discussion_r2662773701 ## datafusion/physical-plan/src/aggregates/group_values/row.rs: ## @@ -206,37 +233,52 @@ impl GroupValues for GroupValuesRows { output

Re: [PR] Incremental group emission in HashAggregate [datafusion]

2026-01-05 Thread via GitHub
alamb-ghbot commented on PR #19562: URL: https://github.com/apache/datafusion/pull/19562#issuecomment-3712048320 🤖 `./gh_compare_branch.sh` [gh_compare_branch.sh](https://github.com/alamb/datafusion-benchmarking/blob/main/scripts/gh_compare_branch.sh) Running Linux aal-dev 6.14.0-1018-gc

Re: [PR] Incremental group emission in HashAggregate [datafusion]

2026-01-05 Thread via GitHub
alamb commented on PR #19562: URL: https://github.com/apache/datafusion/pull/19562#issuecomment-3712047945 run benchmarks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [PR] fix: Fix to_json handling of NaN and Infinity values (#3016) [datafusion-comet]

2026-01-05 Thread via GitHub
andygrove commented on code in PR #3018: URL: https://github.com/apache/datafusion-comet/pull/3018#discussion_r2662775773 ## native/spark-expr/src/json_funcs/to_json.rs: ## @@ -181,6 +188,23 @@ fn escape_string(input: &str) -> String { escaped_string } +fn normalize_spec

Re: [PR] fix: Fix to_json handling of NaN and Infinity values (#3016) [datafusion-comet]

2026-01-05 Thread via GitHub
andygrove commented on code in PR #3018: URL: https://github.com/apache/datafusion-comet/pull/3018#discussion_r2662782397 ## native/spark-expr/src/json_funcs/to_json.rs: ## @@ -181,6 +188,23 @@ fn escape_string(input: &str) -> String { escaped_string } +fn normalize_spec

Re: [PR] chore(deps): bump tracing from 0.1.43 to 0.1.44 [datafusion]

2026-01-05 Thread via GitHub
alamb merged PR #19644: URL: https://github.com/apache/datafusion/pull/19644 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] chore(deps): bump syn from 2.0.111 to 2.0.113 [datafusion]

2026-01-05 Thread via GitHub
alamb merged PR #19645: URL: https://github.com/apache/datafusion/pull/19645 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

[I] cargo audit failing on main [datafusion]

2026-01-05 Thread via GitHub
alamb opened a new issue, #19656: URL: https://github.com/apache/datafusion/issues/19656 ### Describe the bug Here is an example: https://github.com/apache/datafusion/actions/runs/20728924060/job/59511596846 ### To Reproduce It appears due to `aws-smithy-runtime`

Re: [PR] feat: support `substrait_plan` and remove deprecated `sql` field from `ExecuteQueryParams.Query` [datafusion-ballista]

2026-01-05 Thread via GitHub
milenkovicm commented on code in PR #1360: URL: https://github.com/apache/datafusion-ballista/pull/1360#discussion_r2661979782 ## ballista/scheduler/Cargo.toml: ## @@ -52,6 +53,7 @@ clap = { workspace = true, optional = true } dashmap = { workspace = true } datafusion = { wor

Re: [PR] perf: optimize `NthValue` when `ignore_nulls` is true [datafusion]

2026-01-05 Thread via GitHub
mzabaluev commented on PR #19496: URL: https://github.com/apache/datafusion/pull/19496#issuecomment-3710992619 Benchmark results against the branch base ``` nth_value_ignore_nulls/first_value_expanding/0%_nulls time: [229.32 µs 229.97 µs 230.68 µs]

[PR] feat: support `SELECT DISTINCT id FROM t ORDER BY id LIMIT n` query use GroupedTopKAggregateStream [datafusion]

2026-01-05 Thread via GitHub
haohuaijin opened a new pull request, #19653: URL: https://github.com/apache/datafusion/pull/19653 ## Which issue does this PR close? close https://github.com/apache/datafusion/issues/19638 ## Rationale for this change see issue #19638 ## What changes are included

Re: [PR] perf: optimize `NthValue` when `ignore_nulls` is true [datafusion]

2026-01-05 Thread via GitHub
mzabaluev commented on code in PR #19496: URL: https://github.com/apache/datafusion/pull/19496#discussion_r2661969517 ## datafusion/functions-window/src/nth_value.rs: ## @@ -519,6 +467,87 @@ impl PartitionEvaluator for NthValueEvaluator { } } +impl NthValueEvaluator { +

Re: [PR] feat: support `substrait_plan` and remove deprecated `sql` field from `ExecuteQueryParams.Query` [datafusion-ballista]

2026-01-05 Thread via GitHub
mattcuento commented on code in PR #1360: URL: https://github.com/apache/datafusion-ballista/pull/1360#discussion_r2662111775 ## Cargo.toml: ## @@ -39,6 +39,7 @@ datafusion = "51.0.0" datafusion-cli = "51.0.0" datafusion-proto = "51.0.0" datafusion-proto-common = "51.0.0" +d

Re: [PR] feat: support `substrait_plan` and remove deprecated `sql` field from `ExecuteQueryParams.Query` [datafusion-ballista]

2026-01-05 Thread via GitHub
milenkovicm commented on code in PR #1360: URL: https://github.com/apache/datafusion-ballista/pull/1360#discussion_r2662119757 ## Cargo.toml: ## @@ -39,6 +39,7 @@ datafusion = "51.0.0" datafusion-cli = "51.0.0" datafusion-proto = "51.0.0" datafusion-proto-common = "51.0.0" +

Re: [PR] Row group limit pruning [datafusion]

2026-01-05 Thread via GitHub
alamb commented on PR #18868: URL: https://github.com/apache/datafusion/pull/18868#issuecomment-3711927679 Hi @xudong963 -- I am now back from vacation and will review this PR either later today or tomorrow -- This is an automated message from the Apache Git Service. To respond to the me

Re: [PR] Remove coalesce batches rule and deprecate CoalesceBatchesExec [datafusion]

2026-01-05 Thread via GitHub
Dandandan commented on code in PR #19622: URL: https://github.com/apache/datafusion/pull/19622#discussion_r2662671140 ## datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs: ## @@ -564,6 +566,7 @@ fn test_pushdown_through_aggregates_on_grouping_columns() { // 2.

Re: [PR] Perf: Optimize `substring_index` via single-byte fast path and direct indexing [datafusion]

2026-01-05 Thread via GitHub
alamb-ghbot commented on PR #19590: URL: https://github.com/apache/datafusion/pull/19590#issuecomment-3711931023 🤖 Hi @alamb, thanks for the request (https://github.com/apache/datafusion/pull/19590#issuecomment-3711930835). [`scrape_comments.py`](https://github.com/alamb/datafusion-b

Re: [PR] Perf: Optimize `substring_index` via single-byte fast path and direct indexing [datafusion]

2026-01-05 Thread via GitHub
alamb commented on PR #19590: URL: https://github.com/apache/datafusion/pull/19590#issuecomment-3711930835 run benchmark substr_index -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] Remove coalesce batches rule and deprecate CoalesceBatchesExec [datafusion]

2026-01-05 Thread via GitHub
alamb commented on PR #19622: URL: https://github.com/apache/datafusion/pull/19622#issuecomment-3711902737 run benchmark clickbench_partitioned -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] Remove coalesce batches rule and deprecate CoalesceBatchesExec [datafusion]

2026-01-05 Thread via GitHub
alamb-ghbot commented on PR #19622: URL: https://github.com/apache/datafusion/pull/19622#issuecomment-3711903165 🤖 `./gh_compare_branch.sh` [gh_compare_branch.sh](https://github.com/alamb/datafusion-benchmarking/blob/main/scripts/gh_compare_branch.sh) Running Linux aal-dev 6.14.0-1018-gc

Re: [PR] Remove coalesce batches rule and deprecate CoalesceBatchesExec [datafusion]

2026-01-05 Thread via GitHub
alamb commented on PR #19622: URL: https://github.com/apache/datafusion/pull/19622#issuecomment-3711902125 > QQuery 23 still seems to be leading ahead! I suspect this has to do with timing. Basically Q23 is like `select * from ... WHERE ... ` type query This can now takes adv

Re: [PR] Remove coalesce batches rule and deprecate CoalesceBatchesExec [datafusion]

2026-01-05 Thread via GitHub
alamb commented on code in PR #19622: URL: https://github.com/apache/datafusion/pull/19622#discussion_r2662653579 ## datafusion/physical-plan/src/coalesce_batches.rs: ## @@ -57,6 +57,10 @@ use futures::stream::{Stream, StreamExt}; /// reaches the `fetch` value. /// /// See [`

Re: [PR] Perf: Optimize `substring_index` via single-byte fast path and direct indexing [datafusion]

2026-01-05 Thread via GitHub
alamb commented on code in PR #19590: URL: https://github.com/apache/datafusion/pull/19590#discussion_r2662689222 ## datafusion/functions/src/unicode/substrindex.rs: ## @@ -182,7 +182,8 @@ fn substr_index_general< where T::Native: OffsetSizeTrait, { -let mut builder =

Re: [PR] feat: support `substrait_plan` and remove deprecated `sql` field from `ExecuteQueryParams.Query` [datafusion-ballista]

2026-01-05 Thread via GitHub
mattcuento commented on code in PR #1360: URL: https://github.com/apache/datafusion-ballista/pull/1360#discussion_r2661844363 ## ballista/scheduler/Cargo.toml: ## @@ -52,6 +53,7 @@ clap = { workspace = true, optional = true } dashmap = { workspace = true } datafusion = { work

Re: [I] Fail the build if public method is missing rustdoc [datafusion-ballista]

2026-01-05 Thread via GitHub
milenkovicm commented on issue #1258: URL: https://github.com/apache/datafusion-ballista/issues/1258#issuecomment-375904 thanks @killzoner will have fun rebasing #1361 😀 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHu

Re: [PR] feat: support `substrait_plan` and remove deprecated `sql` field from `ExecuteQueryParams.Query` [datafusion-ballista]

2026-01-05 Thread via GitHub
mattcuento commented on PR #1360: URL: https://github.com/apache/datafusion-ballista/pull/1360#issuecomment-3711375113 Will review the latest `test linux balista/crates` failures this evening -- This is an automated message from the Apache Git Service. To respond to the message, please lo

Re: [PR] feat: Bump rust to `rust:1.92-trixie` [datafusion-ballista]

2026-01-05 Thread via GitHub
mattcuento commented on PR #1365: URL: https://github.com/apache/datafusion-ballista/pull/1365#issuecomment-3711357963 Eh that's silly, I'll update both here to get merged together. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to Gi

Re: [PR] feat: support `substrait_plan` and remove deprecated `sql` field from `ExecuteQueryParams.Query` [datafusion-ballista]

2026-01-05 Thread via GitHub
milenkovicm commented on PR #1360: URL: https://github.com/apache/datafusion-ballista/pull/1360#issuecomment-3711385863 perhaps, action getting out of disk space ? looks like linker freaking out, which should not be related to your change -- This is an automated message from the Apache

Re: [PR] feat: allow native Iceberg scans with non-identity transform residuals [datafusion-comet]

2026-01-05 Thread via GitHub
parthchandra commented on code in PR #2948: URL: https://github.com/apache/datafusion-comet/pull/2948#discussion_r2662264601 ## spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala: ## @@ -478,29 +478,33 @@ case class CometScanRule(session: SparkSession) extends Rule

Re: [PR] perf: Improve string to int perf [datafusion-comet]

2026-01-05 Thread via GitHub
coderfender commented on PR #3017: URL: https://github.com/apache/datafusion-comet/pull/3017#issuecomment-3711389280 @andygrove , sure Here are the benchmarks compared through `critcmp` ``` groupfeature ma

[I] Offest parquet pushdown [datafusion]

2026-01-05 Thread via GitHub
AntoinePrv opened a new issue, #19654: URL: https://github.com/apache/datafusion/issues/19654 ### Is your feature request related to a problem or challenge? `Dataframe::limit` offset option is not used to skip rows when reading a parquet file. Using this reproducer with `datafu

Re: [PR] Add one-step FilterExec creation with projection (#19608) [datafusion]

2026-01-05 Thread via GitHub
GaneshPatil7517 commented on PR #19619: URL: https://github.com/apache/datafusion/pull/19619#issuecomment-3711788173 > @adriangb could you run the workflows again? @GaneshPatil7517 since `with_projection` and `with_batch_size` are being deprecated, we also need to update those uses in DataF

Re: [PR] feat: Bump docker rust to `rust:1.92-trixie` [datafusion-ballista]

2026-01-05 Thread via GitHub
mattcuento commented on PR #1365: URL: https://github.com/apache/datafusion-ballista/pull/1365#issuecomment-3711788599 @milenkovicm yep please feel free to merge, looks like it's no longer in draft now -- This is an automated message from the Apache Git Service. To respond to the message

Re: [PR] perf: Improve string to int perf [datafusion-comet]

2026-01-05 Thread via GitHub
coderfender commented on PR #3017: URL: https://github.com/apache/datafusion-comet/pull/3017#issuecomment-3711785788 In other notes, I was also experimenting in implementing a two pass fast algorithm using a switch fallthrough (similar to what my friend wrote here) but the implementation b

Re: [PR] perf: Improve string to int perf [datafusion-comet]

2026-01-05 Thread via GitHub
andygrove commented on PR #3017: URL: https://github.com/apache/datafusion-comet/pull/3017#issuecomment-3711805196 > In other notes, I was also experimenting in implementing a two pass fast algorithm using a switch fallthrough but the implementation became super complicated with diminishin

[I] Use pipeline aggregation when data is implicitly sorted by group-by keys [datafusion]

2026-01-05 Thread via GitHub
NGA-TRAN opened a new issue, #19655: URL: https://github.com/apache/datafusion/issues/19655 ### Is your feature request related to a problem or challenge? We have a use case where the query groups by columns that are implicitly sorted, and we would like DataFusion to recognize that or

Re: [PR] feat: Support basic Delta scans [datafusion-comet]

2026-01-05 Thread via GitHub
andygrove commented on PR #3035: URL: https://github.com/apache/datafusion-comet/pull/3035#issuecomment-3711817923 Thanks, @Kimahriman. Please also add content to the documentation (either the user guide or the contributor guide) explaining this new feature. -- This is an automated messa

Re: [PR] feat: Bump docker rust to `rust:1.92-trixie` [datafusion-ballista]

2026-01-05 Thread via GitHub
milenkovicm merged PR #1365: URL: https://github.com/apache/datafusion-ballista/pull/1365 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubsc

Re: [PR] feat: Bump docker rust to `rust:1.92-trixie` [datafusion-ballista]

2026-01-05 Thread via GitHub
milenkovicm commented on PR #1365: URL: https://github.com/apache/datafusion-ballista/pull/1365#issuecomment-3711839508 thanks @mattcuento -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the sp

Re: [PR] Add blog post on extending SQL in DataFusion [datafusion-site]

2026-01-05 Thread via GitHub
geoffreyclaude commented on code in PR #130: URL: https://github.com/apache/datafusion-site/pull/130#discussion_r2661730358 ## content/blog/2025-12-18-extending-sql.md: ## @@ -0,0 +1,379 @@ +--- +layout: post +title: Extending SQL in DataFusion: from ->> to TABLESAMPLE +date: 20

Re: [PR] feat: Add progress bar with ETA estimation to datafusion-cli [datafusion]

2026-01-05 Thread via GitHub
pepijnve commented on code in PR #17867: URL: https://github.com/apache/datafusion/pull/17867#discussion_r2661771351 ## datafusion-cli/src/progress/plan_introspect.rs: ## @@ -0,0 +1,217 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor lic

Re: [PR] perf: optimize bit_length for string arrays [datafusion]

2026-01-05 Thread via GitHub
Brijesh-Thakkar closed pull request #19598: perf: optimize bit_length for string arrays URL: https://github.com/apache/datafusion/pull/19598 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the speci

Re: [PR] feat: support `substrait_plan` and remove deprecated `sql` field from `ExecuteQueryParams.Query` [datafusion-ballista]

2026-01-05 Thread via GitHub
milenkovicm commented on code in PR #1360: URL: https://github.com/apache/datafusion-ballista/pull/1360#discussion_r2661989973 ## Cargo.toml: ## @@ -39,6 +39,7 @@ datafusion = "51.0.0" datafusion-cli = "51.0.0" datafusion-proto = "51.0.0" datafusion-proto-common = "51.0.0" +

Re: [PR] perf: optimize `NthValue` when `ignore_nulls` is true [datafusion]

2026-01-05 Thread via GitHub
mzabaluev commented on code in PR #19496: URL: https://github.com/apache/datafusion/pull/19496#discussion_r2661996145 ## datafusion/functions-window/src/nth_value.rs: ## @@ -519,6 +467,87 @@ impl PartitionEvaluator for NthValueEvaluator { } } +impl NthValueEvaluator { +

Re: [PR] perf: optimize octet_length for string arrays [datafusion]

2026-01-05 Thread via GitHub
Brijesh-Thakkar closed pull request #19581: perf: optimize octet_length for string arrays URL: https://github.com/apache/datafusion/pull/19581 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spe

Re: [PR] Fix NULL handling in ScalarValue::partial_cmp (closes #19579) [datafusion]

2026-01-05 Thread via GitHub
Brijesh-Thakkar closed pull request #19587: Fix NULL handling in ScalarValue::partial_cmp (closes #19579) URL: https://github.com/apache/datafusion/pull/19587 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

Re: [PR] feat: support `substrait_plan` and remove deprecated `sql` field from `ExecuteQueryParams.Query` [datafusion-ballista]

2026-01-05 Thread via GitHub
mattcuento commented on code in PR #1360: URL: https://github.com/apache/datafusion-ballista/pull/1360#discussion_r2662200067 ## Cargo.toml: ## @@ -39,6 +39,7 @@ datafusion = "51.0.0" datafusion-cli = "51.0.0" datafusion-proto = "51.0.0" datafusion-proto-common = "51.0.0" +d

Re: [PR] chore(deps): bump libc from 0.2.178 to 0.2.179 in /native [datafusion-comet]

2026-01-05 Thread via GitHub
mbutrovich merged PR #3038: URL: https://github.com/apache/datafusion-comet/pull/3038 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...

Re: [PR] chore(deps): bump tokio from 1.48.0 to 1.49.0 in /native [datafusion-comet]

2026-01-05 Thread via GitHub
mbutrovich merged PR #3039: URL: https://github.com/apache/datafusion-comet/pull/3039 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...

[PR] feat: Bump rust to `rust:1.92-trixie` [datafusion-ballista]

2026-01-05 Thread via GitHub
mattcuento opened a new pull request, #1365: URL: https://github.com/apache/datafusion-ballista/pull/1365 # Which issue does this PR close? Closes #. # Rationale for this change Bumping rust version to keep up to date with the `ballista-builder.Dockerfile`. It w

Re: [PR] perf: Improve string to int perf [datafusion-comet]

2026-01-05 Thread via GitHub
andygrove commented on PR #3017: URL: https://github.com/apache/datafusion-comet/pull/3017#issuecomment-3711326648 Thanks @coderfender. I think it would be useful to add a criterion benchmark as well, so we can more easily measure the improvement compared to the main branch. -- This is

Re: [PR] perf: Improve string to int perf [datafusion-comet]

2026-01-05 Thread via GitHub
coderfender commented on PR #3017: URL: https://github.com/apache/datafusion-comet/pull/3017#issuecomment-3711333041 Sure @andygrove . Let me get the benchmark file from stash and push a commit -- This is an automated message from the Apache Git Service. To respond to the message, plea

Re: [PR] feat: Bump rust to `rust:1.92-trixie` [datafusion-ballista]

2026-01-05 Thread via GitHub
milenkovicm commented on PR #1365: URL: https://github.com/apache/datafusion-ballista/pull/1365#issuecomment-3711347038 https://github.com/apache/datafusion-ballista/blob/8ac74028c5f21faf519a812b5cb44946a389dc81/dev/docker/ballista-builder.Dockerfile#L18 as well -- This is an automated

Re: [PR] feat: Bump rust to `rust:1.92-trixie` [datafusion-ballista]

2026-01-05 Thread via GitHub
mattcuento commented on PR #1365: URL: https://github.com/apache/datafusion-ballista/pull/1365#issuecomment-3711350775 @milenkovicm the ballista-builder reference will get bumped in #1360 to fix the build issues with the protobuf compiler 🙂 -- This is an automated message from the Apach

Re: [PR] feat: adaptive filter selectivity tracking for Parquet row filters [datafusion]

2026-01-05 Thread via GitHub
alamb commented on PR #19639: URL: https://github.com/apache/datafusion/pull/19639#issuecomment-3711961830 Did you find any evidence that the selectivity of predicates changes over the course of the query (or put another way that reordering them during execution would help?) -- This is a

Re: [PR] perfect hash join [datafusion]

2026-01-05 Thread via GitHub
Dandandan commented on code in PR #19411: URL: https://github.com/apache/datafusion/pull/19411#discussion_r2662706360 ## datafusion/common/src/config.rs: ## @@ -468,6 +468,25 @@ config_namespace! { /// metadata memory consumption pub batch_size: usize, default

Re: [PR] perfect hash join [datafusion]

2026-01-05 Thread via GitHub
Dandandan commented on PR #19411: URL: https://github.com/apache/datafusion/pull/19411#issuecomment-3711965227 Can you solve the conflicts? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spe

Re: [PR] feat: adaptive filter selectivity tracking for Parquet row filters [datafusion]

2026-01-05 Thread via GitHub
adriangb commented on PR #19639: URL: https://github.com/apache/datafusion/pull/19639#issuecomment-3711967371 > Did you find any evidence that the selectivity of predicates changes over the course of the query (or put another way that reordering them during execution would help?) I i

Re: [PR] table scoped lfc [datafusion]

2026-01-05 Thread via GitHub
alamb commented on code in PR #19616: URL: https://github.com/apache/datafusion/pull/19616#discussion_r2662704588 ## datafusion/execution/src/cache/list_files_cache.rs: ## @@ -146,9 +149,12 @@ pub const DEFAULT_LIST_FILES_CACHE_MEMORY_LIMIT: usize = 1024 * 1024; // 1MiB /// Th

Re: [PR] Fix to_json handling of NaN and Infinity values (#3016) [datafusion-comet]

2026-01-05 Thread via GitHub
andygrove commented on code in PR #3018: URL: https://github.com/apache/datafusion-comet/pull/3018#discussion_r2662709821 ## native/spark-expr/src/json_funcs/to_json.rs: ## @@ -181,6 +188,23 @@ fn escape_string(input: &str) -> String { escaped_string } +fn normalize_spec

Re: [PR] Remove coalesce batches rule and deprecate CoalesceBatchesExec [datafusion]

2026-01-05 Thread via GitHub
alamb-ghbot commented on PR #19622: URL: https://github.com/apache/datafusion/pull/19622#issuecomment-371196 🤖: Benchmark completed Details ``` Comparing HEAD and feat-deprecate-coalesce-batches Benchmark clickbench_partitioned.json -

Re: [PR] fix: Fix to_json handling of NaN and Infinity values (#3016) [datafusion-comet]

2026-01-05 Thread via GitHub
codecov-commenter commented on PR #3018: URL: https://github.com/apache/datafusion-comet/pull/3018#issuecomment-3711988180 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/3018?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] Chore: to_json unit/benchmark tests [datafusion-comet]

2026-01-05 Thread via GitHub
andygrove commented on PR #3011: URL: https://github.com/apache/datafusion-comet/pull/3011#issuecomment-3711987315 @kazantsev-maksim could you merge latest from main to disable the failing test -- This is an automated message from the Apache Git Service. To respond to the message, please

Re: [I] Release DataFusion `52.0.0` (Dec 2025 / Jan 2026) [datafusion]

2026-01-05 Thread via GitHub
AdamGS commented on issue #18566: URL: https://github.com/apache/datafusion/issues/18566#issuecomment-3711986198 Tested with vortex and it looks good - https://github.com/vortex-data/vortex/pull/5863 -- This is an automated message from the Apache Git Service. To respond to the message, p

Re: [PR] Perf: Optimize `substring_index` via single-byte fast path and direct indexing [datafusion]

2026-01-05 Thread via GitHub
alamb-ghbot commented on PR #19590: URL: https://github.com/apache/datafusion/pull/19590#issuecomment-3711989071 🤖 `./gh_compare_branch_bench.sh` [compare_branch_bench.sh](https://github.com/alamb/datafusion-benchmarking/blob/main/scripts/compare_branch_bench.sh) Running Linux aal-dev 6.

Re: [I] Performance of parquet pushdown with offset [datafusion]

2026-01-05 Thread via GitHub
alamb commented on issue #19654: URL: https://github.com/apache/datafusion/issues/19654#issuecomment-3711997077 Hi @AntoinePrv -- I am definitely surprised at this finding (I would expect DataFusion to do this pretty fast) I think what is happening is that datafusion is not pushing d

Re: [I] Performance of parquet pushdown with offset [datafusion]

2026-01-05 Thread via GitHub
alamb commented on issue #19654: URL: https://github.com/apache/datafusion/issues/19654#issuecomment-3711997858 So TLDR is we need to implement the `offset` optimization in the parquet scan -- This is an automated message from the Apache Git Service. To respond to the message, please log o

Re: [PR] Perf: Optimize `substring_index` via single-byte fast path and direct indexing [datafusion]

2026-01-05 Thread via GitHub
alamb-ghbot commented on PR #19590: URL: https://github.com/apache/datafusion/pull/19590#issuecomment-3712006434 🤖: Benchmark completed Details ``` group main perf_substrindex -

Re: [PR] feat: support `substrait_plan` and remove deprecated `sql` field from `ExecuteQueryParams.Query` [datafusion-ballista]

2026-01-05 Thread via GitHub
milenkovicm commented on code in PR #1360: URL: https://github.com/apache/datafusion-ballista/pull/1360#discussion_r2661903007 ## ballista/scheduler/Cargo.toml: ## @@ -52,6 +53,7 @@ clap = { workspace = true, optional = true } dashmap = { workspace = true } datafusion = { wor

[I] `auto` scan mode should select `native_datafusion` for supported use cases [datafusion-comet]

2026-01-05 Thread via GitHub
andygrove opened a new issue, #3040: URL: https://github.com/apache/datafusion-comet/issues/3040 ### What is the problem the feature request solves? `auto` scan mode should select `native_datafusion` for supported use cases. ### Describe the potential solution _No respons

  1   2   3   >