[PR] Generate GroupByHash output in multiple RecordBatches (Just an attempt) [datafusion]

2024-08-01 Thread via GitHub
JasonLi-cn opened a new pull request, #11758: URL: https://github.com/apache/datafusion/pull/11758 ## Which issue does this PR close? Closes https://github.com/apache/datafusion/issues/9562 ## Rationale for this change ## What changes are included in this

[PR] refactor: move ExecutionPlan and related structs into dedicated mod [datafusion]

2024-08-01 Thread via GitHub
waynexia opened a new pull request, #11759: URL: https://github.com/apache/datafusion/pull/11759 ## Which issue does this PR close? Related to https://github.com/apache/datafusion/issues/11375 ## Rationale for this change Try to break large `lib.rs` into s

[PR] [Minor] Add test for only [datafusion]

2024-08-01 Thread via GitHub
Dandandan opened a new pull request, #11760: URL: https://github.com/apache/datafusion/pull/11760 ## Which issue does this PR close? This was failing in an earlier version we were using (39) but is fixed more recently. Adding test to make sure this doesn't break in the futu

Re: [PR] Fix documentation warnings, make CsvExecBuilder and Unparsed pub [datafusion]

2024-08-01 Thread via GitHub
jonahgao commented on code in PR #11729: URL: https://github.com/apache/datafusion/pull/11729#discussion_r1699854883 ## datafusion/functions-aggregate/src/string_agg.rs: ## @@ -15,7 +15,7 @@ // specific language governing permissions and limitations // under the License. -//

Re: [I] Create `name` of aggregate expression from expressions [datafusion]

2024-08-01 Thread via GitHub
lewiszlw commented on issue #11707: URL: https://github.com/apache/datafusion/issues/11707#issuecomment-2262658651 How to handle aliased aggr expr? Keep an optional name param in `create_aggregate_expr` or Change `aggr_expr` field of `AggregateExec` from `Vec>` to `Vec<(Arc, String)>`? --

Re: [PR] Skipping partial aggregation when it is not helping for high cardinality aggregates [datafusion]

2024-08-01 Thread via GitHub
alamb commented on code in PR #11627: URL: https://github.com/apache/datafusion/pull/11627#discussion_r1699904215 ## datafusion/functions-aggregate/src/count.rs: ## @@ -433,6 +433,49 @@ impl GroupsAccumulator for CountGroupsAccumulator { Ok(vec![Arc::new(counts) as Arra

Re: [PR] Skipping partial aggregation when it is not helping for high cardinality aggregates [datafusion]

2024-08-01 Thread via GitHub
alamb commented on code in PR #11627: URL: https://github.com/apache/datafusion/pull/11627#discussion_r1699904215 ## datafusion/functions-aggregate/src/count.rs: ## @@ -433,6 +433,49 @@ impl GroupsAccumulator for CountGroupsAccumulator { Ok(vec![Arc::new(counts) as Arra

[PR] Improve `AccumulatorArgs` by Removing the usgaes of `schema` and `input_types` [datafusion]

2024-08-01 Thread via GitHub
xinlifoobar opened a new pull request, #11761: URL: https://github.com/apache/datafusion/pull/11761 ## Which issue does this PR close? Part of #11725 ## Rationale for this change ## What changes are included in this PR? ## Are these changes

Re: [PR] Improve `AccumulatorArgs` by Removing the usgaes of `schema` and `input_types` [datafusion]

2024-08-01 Thread via GitHub
xinlifoobar commented on code in PR #11761: URL: https://github.com/apache/datafusion/pull/11761#discussion_r1699923583 ## datafusion/expr/src/function.rs: ## @@ -54,9 +55,6 @@ pub struct AccumulatorArgs<'a> { /// The return type of the aggregate function. Review Comment:

Re: [PR] Do not push down Sorts if it violates the sort requirements [datafusion]

2024-08-01 Thread via GitHub
alamb commented on PR #11678: URL: https://github.com/apache/datafusion/pull/11678#issuecomment-2262710828 Thank you for the review @mustafasrepo -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] Do not push down Sorts if it violates the sort requirements [datafusion]

2024-08-01 Thread via GitHub
alamb merged PR #11678: URL: https://github.com/apache/datafusion/pull/11678 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [I] `pushdown_sorts` pushes a SortExec through a node in violation of its stated input ordering requirements [datafusion]

2024-08-01 Thread via GitHub
alamb closed issue #11675: `pushdown_sorts` pushes a SortExec through a node in violation of its stated input ordering requirements URL: https://github.com/apache/datafusion/issues/11675 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to G

Re: [PR] Test + workaround for `SanityCheckPlan` error [datafusion]

2024-08-01 Thread via GitHub
alamb closed pull request #11493: Test + workaround for `SanityCheckPlan` error URL: https://github.com/apache/datafusion/pull/11493 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comme

Re: [PR] Reuse hash [datafusion]

2024-08-01 Thread via GitHub
alamb commented on PR #11708: URL: https://github.com/apache/datafusion/pull/11708#issuecomment-2262720435 If it turns out that reusing hash values isn't a good idea, I think we could close the ticket as "won't do" Another thought, similarly to what @andygrove has been exploring in h

Re: [I] Improve performance of high cardinality grouping by reusing hash values [datafusion]

2024-08-01 Thread via GitHub
alamb commented on issue #11680: URL: https://github.com/apache/datafusion/issues/11680#issuecomment-2262722015 @Dandandan has a good point on https://github.com/apache/datafusion/pull/11708#issuecomment-2262084412 that is some cases (like a network shuffle) passing the hash values might b

Re: [PR] Reuse hash [datafusion]

2024-08-01 Thread via GitHub
alamb commented on PR #11708: URL: https://github.com/apache/datafusion/pull/11708#issuecomment-2262723812 Another point is that generating the hash value is most expensive for string columns -- so maybe we could only reuse hash values when there are string columns as part of the groupby ex

Re: [PR] Use upstream `StatisticsConverter` from arrow-rs in DataFusion [datafusion]

2024-08-01 Thread via GitHub
alamb commented on code in PR #11479: URL: https://github.com/apache/datafusion/pull/11479#discussion_r1699984627 ## datafusion/core/src/datasource/physical_plan/parquet/row_group_filter.rs: ## @@ -356,20 +356,24 @@ impl<'a> RowGroupPruningStatistics<'a> { &'a self,

Re: [PR] Use upstream `StatisticsConverter` from arrow-rs in DataFusion [datafusion]

2024-08-01 Thread via GitHub
alamb commented on PR #11479: URL: https://github.com/apache/datafusion/pull/11479#issuecomment-2262805898 Thank you for the reviews @Ted-Jiang and @comphead -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abo

Re: [PR] Use upstream `StatisticsConverter` from arrow-rs in DataFusion [datafusion]

2024-08-01 Thread via GitHub
alamb merged PR #11479: URL: https://github.com/apache/datafusion/pull/11479 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [I] Reduce test duplication in tests for data page stattistics [datafusion]

2024-08-01 Thread via GitHub
alamb closed issue #11000: Reduce test duplication in tests for data page stattistics URL: https://github.com/apache/datafusion/issues/11000 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the speci

Re: [PR] Rename RepartitionExec metric `repart_time` to `repartition_time` [datafusion]

2024-08-01 Thread via GitHub
alamb merged PR #11703: URL: https://github.com/apache/datafusion/pull/11703 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [I] Rename RepartitionExec metric `repart_time` to `repartition_time` [datafusion]

2024-08-01 Thread via GitHub
alamb closed issue #11702: Rename RepartitionExec metric `repart_time` to `repartition_time` URL: https://github.com/apache/datafusion/issues/11702 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to th

Re: [PR] Rename RepartitionExec metric `repart_time` to `repartition_time` [datafusion]

2024-08-01 Thread via GitHub
alamb commented on PR #11703: URL: https://github.com/apache/datafusion/pull/11703#issuecomment-2262808454 It seems people like this naming better, so I will merge it in. Thanks everyone for your comments and review -- This is an automated message from the Apache Git Service. To respond t

Re: [PR] Fix `plan_to_sql`: Add wildcard projection to SELECT statement if no projection was set [datafusion]

2024-08-01 Thread via GitHub
alamb merged PR #11744: URL: https://github.com/apache/datafusion/pull/11744 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [I] `TableScan` plan without projection is not converted correctly to a sql statement [datafusion]

2024-08-01 Thread via GitHub
alamb closed issue #11743: `TableScan` plan without projection is not converted correctly to a sql statement URL: https://github.com/apache/datafusion/issues/11743 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] Fix `plan_to_sql`: Add wildcard projection to SELECT statement if no projection was set [datafusion]

2024-08-01 Thread via GitHub
alamb commented on PR #11744: URL: https://github.com/apache/datafusion/pull/11744#issuecomment-2262813214 🚀 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe

Re: [I] Create `name` of aggregate expression from expressions [datafusion]

2024-08-01 Thread via GitHub
jayzhan211 commented on issue #11707: URL: https://github.com/apache/datafusion/issues/11707#issuecomment-2262827066 It's not clear to me why do we need to handle aliased aggr expr -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitH

Re: [PR] Reuse hash [datafusion]

2024-08-01 Thread via GitHub
jayzhan211 commented on PR #11708: URL: https://github.com/apache/datafusion/pull/11708#issuecomment-2262836799 Let's find out if combining group + repartition is reasonable, if yes, whether to reuse hash should be trivial since we don't need to convert it to record batch anymore -- This

Re: [I] DataFusion weekly project plan (Andrew Lamb) - July 29, 2024 [datafusion]

2024-08-01 Thread via GitHub
alamb commented on issue #11710: URL: https://github.com/apache/datafusion/issues/11710#issuecomment-2262855223 REview Queue: arrow-rs - [ ] Take kernel speed in arrow-rs: https://github.com/apache/arrow-rs/pull/6168 - [ ] bytestrema split: https://github.com/apache/arrow-rs/pu

Re: [PR] Use upstream `DataType::from_str` in arrow-cast [datafusion]

2024-08-01 Thread via GitHub
alamb merged PR #11254: URL: https://github.com/apache/datafusion/pull/11254 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] Use upstream `DataType::from_str` in arrow-cast [datafusion]

2024-08-01 Thread via GitHub
alamb commented on PR #11254: URL: https://github.com/apache/datafusion/pull/11254#issuecomment-2262857165 > lgtm thanks @alamb awesome clean up. errors macros can be added later I was messing around with the PR anyways so I made the changes to use error macros. Thanks again @c

Re: [PR] Fix documentation warnings, make CsvExecBuilder and Unparsed pub [datafusion]

2024-08-01 Thread via GitHub
alamb commented on PR #11729: URL: https://github.com/apache/datafusion/pull/11729#issuecomment-2262857833 Thanks @jonahgao for the review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] Fix documentation warnings, make CsvExecBuilder and Unparsed pub [datafusion]

2024-08-01 Thread via GitHub
alamb merged PR #11729: URL: https://github.com/apache/datafusion/pull/11729 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [I] doc warnings about private items [datafusion]

2024-08-01 Thread via GitHub
alamb closed issue #11728: doc warnings about private items URL: https://github.com/apache/datafusion/issues/11728 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscrib

Re: [PR] [Minor] Add test for only nulls (empty) as input in APPROX_PERCENTILE_CONT [datafusion]

2024-08-01 Thread via GitHub
alamb merged PR #11760: URL: https://github.com/apache/datafusion/pull/11760 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] refactor: move ExecutionPlan and related structs into dedicated mod [datafusion]

2024-08-01 Thread via GitHub
alamb commented on code in PR #11759: URL: https://github.com/apache/datafusion/pull/11759#discussion_r1700022065 ## datafusion/physical-plan/src/lib.rs: ## @@ -14,46 +14,36 @@ // KIND, either express or implied. See the License for the // specific language governing permissi

Re: [I] Cannot convert Expr to InList [datafusion-python]

2024-08-01 Thread via GitHub
timsaucer commented on issue #781: URL: https://github.com/apache/datafusion-python/issues/781#issuecomment-2262912815 Just for my own education, what are the use cases for invoking `.to_variant()`? I came across these when I was working on python wrapper functions and I don't know how peo

[PR] Single mode for group by values [datafusion]

2024-08-01 Thread via GitHub
jayzhan211 opened a new pull request, #11762: URL: https://github.com/apache/datafusion/pull/11762 ## Which issue does this PR close? Closes #. ## Rationale for this change ## What changes are included in this PR? ## Are these changes tested

Re: [I] Improve performance of high cardinality grouping by reusing hash values [datafusion]

2024-08-01 Thread via GitHub
jayzhan211 commented on issue #11680: URL: https://github.com/apache/datafusion/issues/11680#issuecomment-2262949557 I found I got performance boost for clickbench Q17 just by enforcing single mode for group by, interesting ```rust Comparing main and single-multi-groupby ---

Re: [PR] Move min and max to user defined aggregate function [datafusion]

2024-08-01 Thread via GitHub
edmondop commented on code in PR #11013: URL: https://github.com/apache/datafusion/pull/11013#discussion_r1700094985 ## datafusion/functions-aggregate/src/min_max.rs: ## @@ -15,103 +15,165 @@ // specific language governing permissions and limitations // under the License. -/

Re: [PR] Improve `AccumulatorArgs` by removing the usgaes of `schema` and `input_types` [datafusion]

2024-08-01 Thread via GitHub
jayzhan211 commented on code in PR #11761: URL: https://github.com/apache/datafusion/pull/11761#discussion_r1700124481 ## datafusion/expr/src/function.rs: ## @@ -54,9 +55,6 @@ pub struct AccumulatorArgs<'a> { /// The return type of the aggregate function. pub data_type

Re: [PR] Rename `input_type` --> `input_types` on AggregateFunctionExpr / AccumulatorArgs / StateFieldsArgs [datafusion]

2024-08-01 Thread via GitHub
jayzhan211 commented on PR #11666: URL: https://github.com/apache/datafusion/pull/11666#issuecomment-2263026860 > I think `schema` is what makes sense as `dfschema` is mostly a logical thing in my mind (as it has a relation name). @xinlifoobar I think it is not clear whether we should

[PR] Minor: add "clickbench extended" queries to slt tests [datafusion]

2024-08-01 Thread via GitHub
alamb opened a new pull request, #11763: URL: https://github.com/apache/datafusion/pull/11763 ## Which issue does this PR close? Related to https://github.com/apache/datafusion/pull/11723 ## Rationale for this change While working to enable `StringView` by default I h

Re: [PR] Move min and max to user defined aggregate function [datafusion]

2024-08-01 Thread via GitHub
jayzhan211 commented on code in PR #11013: URL: https://github.com/apache/datafusion/pull/11013#discussion_r1700177718 ## datafusion/functions-aggregate/src/min_max.rs: ## @@ -15,103 +15,165 @@ // specific language governing permissions and limitations // under the License.

Re: [PR] perf: decimal decode improvements [datafusion-comet]

2024-08-01 Thread via GitHub
andygrove merged PR #727: URL: https://github.com/apache/datafusion-comet/pull/727 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@da

Re: [I] Subquery execution under CometTakeOrderedAndProjectExec fails [datafusion-comet]

2024-08-01 Thread via GitHub
viirya closed issue #749: Subquery execution under CometTakeOrderedAndProjectExec fails URL: https://github.com/apache/datafusion-comet/issues/749 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] fix: subquery execution under CometTakeOrderedAndProjectExec should not fail [datafusion-comet]

2024-08-01 Thread via GitHub
viirya merged PR #748: URL: https://github.com/apache/datafusion-comet/pull/748 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dataf

Re: [PR] fix: subquery execution under CometTakeOrderedAndProjectExec should not fail [datafusion-comet]

2024-08-01 Thread via GitHub
viirya commented on PR #748: URL: https://github.com/apache/datafusion-comet/pull/748#issuecomment-2263154158 Thanks @andygrove -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comme

Re: [PR] doc: Update outdated spark.comet.columnar.shuffle.enabled configuration doc [datafusion-comet]

2024-08-01 Thread via GitHub
viirya merged PR #738: URL: https://github.com/apache/datafusion-comet/pull/738 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dataf

Re: [I] [Doc] Update outdated `spark.comet.columnar.shuffle.enabled` configuration doc [datafusion-comet]

2024-08-01 Thread via GitHub
viirya closed issue #737: [Doc] Update outdated `spark.comet.columnar.shuffle.enabled` configuration doc URL: https://github.com/apache/datafusion-comet/issues/737 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL a

Re: [PR] doc: Update outdated spark.comet.columnar.shuffle.enabled configuration doc [datafusion-comet]

2024-08-01 Thread via GitHub
viirya commented on PR #738: URL: https://github.com/apache/datafusion-comet/pull/738#issuecomment-2263156002 Thanks @wForget @parthchandra -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the sp

Re: [PR] Move min and max to user defined aggregate function [datafusion]

2024-08-01 Thread via GitHub
edmondop commented on PR #11013: URL: https://github.com/apache/datafusion/pull/11013#issuecomment-2263194246 @jayzhan211 fyi all the tests in scalar subquery failed with the stubs, I restored import that used the real implementation -- This is an automated message from the Apache Git Se

Re: [I] Evaluate use of selection vectors in scan-filter-join operations [datafusion-comet]

2024-08-01 Thread via GitHub
andygrove commented on issue #745: URL: https://github.com/apache/datafusion-comet/issues/745#issuecomment-2263196669 This paper may have useful information: "Filter Representation in Vectorized Query Execution" https://db.cs.cmu.edu/papers/2021/ngom-damon2021.pdf -- This is an

Re: [PR] Move min and max to user defined aggregate function [datafusion]

2024-08-01 Thread via GitHub
jayzhan211 commented on PR #11013: URL: https://github.com/apache/datafusion/pull/11013#issuecomment-2263217466 > @jayzhan211 fyi all the tests in scalar subquery failed with the stubs, I restored import that used the real implementation wait, we should use stubs instead of the real f

Re: [PR] feat: Add GetStructField expression [datafusion-comet]

2024-08-01 Thread via GitHub
andygrove commented on code in PR #731: URL: https://github.com/apache/datafusion-comet/pull/731#discussion_r1700316675 ## spark/src/main/scala/org/apache/spark/sql/comet/CometRowToColumnarExec.scala: ## @@ -60,8 +62,17 @@ case class CometRowToColumnarExec(child: SparkPlan)

Re: [PR] Move min and max to user defined aggregate function [datafusion]

2024-08-01 Thread via GitHub
jayzhan211 commented on PR #11013: URL: https://github.com/apache/datafusion/pull/11013#issuecomment-2263247832 You use the name "min" in Max, and "max" in Min for stubs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

Re: [PR] Move min and max to user defined aggregate function [datafusion]

2024-08-01 Thread via GitHub
edmondop commented on PR #11013: URL: https://github.com/apache/datafusion/pull/11013#issuecomment-2263258794 > > @jayzhan211 fyi all the tests in scalar subquery failed with the stubs, I restored import that used the real implementation > > wait, we should use stubs instead of the re

Re: [PR] feat: Implement basic version of RLIKE [datafusion-comet]

2024-08-01 Thread via GitHub
andygrove commented on PR #734: URL: https://github.com/apache/datafusion-comet/pull/734#issuecomment-2263264565 Performance is ok, but not great. ``` OpenJDK 64-Bit Server VM 11.0.24+8-post-Ubuntu-1ubuntu322.04 on Linux 6.5.0-41-generic AMD Ryzen 9 7950X3D 16-Core Processor

Re: [PR] Add in user example that compares a two different approaches to UDFs [datafusion-python]

2024-08-01 Thread via GitHub
andygrove merged PR #770: URL: https://github.com/apache/datafusion-python/pull/770 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@d

Re: [PR] Bugfix: Calling count with None arguments [datafusion-python]

2024-08-01 Thread via GitHub
andygrove merged PR #768: URL: https://github.com/apache/datafusion-python/pull/768 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@d

Re: [PR] Add missing exports for wrapper modules [datafusion-python]

2024-08-01 Thread via GitHub
andygrove merged PR #782: URL: https://github.com/apache/datafusion-python/pull/782 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@d

Re: [PR] feat: Implement basic version of RLIKE [datafusion-comet]

2024-08-01 Thread via GitHub
andygrove commented on PR #734: URL: https://github.com/apache/datafusion-comet/pull/734#issuecomment-2263287666 microbenchmark results: ``` OpenJDK 64-Bit Server VM 11.0.24+8-post-Ubuntu-1ubuntu322.04 on Linux 6.5.0-41-generic AMD Ryzen 9 7950X3D 16-Core Processor TPCDS Micr

Re: [PR] feat: Add support for time-zone, 3 & 5 digit years: Cast from string to timestamp. [datafusion-comet]

2024-08-01 Thread via GitHub
andygrove merged PR #704: URL: https://github.com/apache/datafusion-comet/pull/704 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@da

Re: [I] Re-implement support for count in window aggregate [datafusion-comet]

2024-08-01 Thread via GitHub
andygrove closed issue #645: Re-implement support for count in window aggregate URL: https://github.com/apache/datafusion-comet/issues/645 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] feat: Support count AggregateUDF for window function [datafusion-comet]

2024-08-01 Thread via GitHub
andygrove merged PR #736: URL: https://github.com/apache/datafusion-comet/pull/736 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@da

Re: [PR] fix: skip negative scale checks for creating decimals [datafusion-comet]

2024-08-01 Thread via GitHub
andygrove merged PR #723: URL: https://github.com/apache/datafusion-comet/pull/723 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@da

Re: [PR] Move min and max to user defined aggregate function [datafusion]

2024-08-01 Thread via GitHub
jayzhan211 commented on PR #11013: URL: https://github.com/apache/datafusion/pull/11013#issuecomment-2263310385 > > > @jayzhan211 fyi all the tests in scalar subquery failed with the stubs, I restored import that used the real implementation > > > > > > wait, we should use stubs i

Re: [PR] Add `TrackedMemoryPool` with better error messages on exhaustion [datafusion]

2024-08-01 Thread via GitHub
alamb merged PR #11665: URL: https://github.com/apache/datafusion/pull/11665 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] Derive `Debug` for logical plan nodes [datafusion]

2024-08-01 Thread via GitHub
alamb merged PR #11757: URL: https://github.com/apache/datafusion/pull/11757 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

[I] Min/Max accumulator not implemented for type Float16 [datafusion]

2024-08-01 Thread via GitHub
b4l opened a new issue, #11764: URL: https://github.com/apache/datafusion/issues/11764 ### Describe the bug When aggregating a `Float64` type column, an internal error occurs with the message in the title. ### To Reproduce ```rust df.aggregate(vec![], vec![min(col("fl

Re: [PR] Move min and max to user defined aggregate function [datafusion]

2024-08-01 Thread via GitHub
edmondop commented on PR #11013: URL: https://github.com/apache/datafusion/pull/11013#issuecomment-2263357834 Do you expect any additional changes ? I fixed the stub function name anyways -- This is an automated message from the Apache Git Service. To respond to the message, please log on

Re: [PR] fix: Add additional required expression for natural join [datafusion]

2024-08-01 Thread via GitHub
alamb commented on code in PR #11713: URL: https://github.com/apache/datafusion/pull/11713#discussion_r1700383150 ## datafusion/expr/src/logical_plan/builder.rs: ## @@ -1532,13 +1531,28 @@ pub fn wrap_projection_for_join_if_necessary( let need_project = join_keys.iter().any

[PR] [Minor] Short circuit `ApplyFunctionRewrites` if there are no function rewrites [datafusion]

2024-08-01 Thread via GitHub
gruuya opened a new pull request, #11765: URL: https://github.com/apache/datafusion/pull/11765 ## Which issue does this PR close? Relates to #9373 and #9375. ## Rationale for this change I'm dealing with a situation where we [have deeply nested plans](https://github.com/

[I] CreateNamedStruct fails at runtime on dictionary-encoded string arrays [datafusion-comet]

2024-08-01 Thread via GitHub
andygrove opened a new issue, #750: URL: https://github.com/apache/datafusion-comet/issues/750 ### Describe the bug I ran this query: ```sql SELECT struct(i_item_sk, i_rec_start_date, i_rec_end_date, i_product_name, i_item_desc) as item_detail FROM item; ```

[I] Support string concat `||` for `StringViewArray` [datafusion]

2024-08-01 Thread via GitHub
alamb opened a new issue, #11766: URL: https://github.com/apache/datafusion/issues/11766 **Is your feature request related to a problem or challenge? Please describe what you are trying to do.** ```shell datafusion-cli ``` Then run ```sql DataFusion CLI v40.0.0

[I] Warnings about illegal reflective access [datafusion-comet]

2024-08-01 Thread via GitHub
andygrove opened a new issue, #751: URL: https://github.com/apache/datafusion-comet/issues/751 ### Describe the bug I noticed this while running some benchmarks. I assume that this is a side-effect of the changes in https://github.com/apache/datafusion-comet/pull/723. We may w

Re: [PR] feat: Add GetStructField expression [datafusion-comet]

2024-08-01 Thread via GitHub
Kimahriman commented on code in PR #731: URL: https://github.com/apache/datafusion-comet/pull/731#discussion_r1700429573 ## native/spark-expr/src/structs.rs: ## @@ -125,3 +125,103 @@ impl PartialEq for CreateNamedStruct { .unwrap_or(false) } } + +#[derive(Debu

Re: [PR] feat: Add GetStructField expression [datafusion-comet]

2024-08-01 Thread via GitHub
Kimahriman commented on code in PR #731: URL: https://github.com/apache/datafusion-comet/pull/731#discussion_r1700431011 ## spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala: ## @@ -59,12 +59,13 @@ object QueryPlanSerde extends Logging with ShimQueryPlanSerde wit

Re: [PR] Minor: add "clickbench extended" queries to slt tests [datafusion]

2024-08-01 Thread via GitHub
comphead commented on code in PR #11763: URL: https://github.com/apache/datafusion/pull/11763#discussion_r1700431300 ## datafusion/sqllogictest/test_files/clickbench.slt: ## @@ -274,5 +274,23 @@ query PI SELECT DATE_TRUNC('minute', to_timestamp_seconds("EventTime")) AS M, COUNT

Re: [PR] feat: Add GetStructField expression [datafusion-comet]

2024-08-01 Thread via GitHub
Kimahriman commented on code in PR #731: URL: https://github.com/apache/datafusion-comet/pull/731#discussion_r1700431812 ## spark/src/main/scala/org/apache/spark/sql/comet/CometBatchScanExec.scala: ## @@ -152,3 +153,17 @@ case class CometBatchScanExec(wrapped: BatchScanExec, ru

Re: [PR] feat: Add GetStructField expression [datafusion-comet]

2024-08-01 Thread via GitHub
Kimahriman commented on code in PR #731: URL: https://github.com/apache/datafusion-comet/pull/731#discussion_r1700432052 ## spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala: ## @@ -1756,4 +1756,21 @@ class CometExpressionSuite extends CometTestBase with Adaptive

Re: [I] [Proposal] Decouple logical from physical types [datafusion]

2024-08-01 Thread via GitHub
notfilippo commented on issue #11513: URL: https://github.com/apache/datafusion/issues/11513#issuecomment-2263436262 This past week I've started some work on transitioning Scalar Types to being strictly logical. Most of the refactoring work is done and now I'm reworking some of the casting

[I] COUNT(DISTINCT) on StringView panics: `unreachable code: Utf8/Binary should use ArrowBytesSet` [datafusion]

2024-08-01 Thread via GitHub
alamb opened a new issue, #11767: URL: https://github.com/apache/datafusion/issues/11767 **Describe the bug** I found one of the clickbench extended queries panics when using StringView -- https://github.com/apache/datafusion/pull/11723 **To Reproduce** ``` > create table foo

Re: [PR] Minor: add "clickbench extended" queries to slt tests [datafusion]

2024-08-01 Thread via GitHub
alamb commented on code in PR #11763: URL: https://github.com/apache/datafusion/pull/11763#discussion_r1700492887 ## datafusion/sqllogictest/test_files/clickbench.slt: ## @@ -274,5 +274,23 @@ query PI SELECT DATE_TRUNC('minute', to_timestamp_seconds("EventTime")) AS M, COUNT(*)

Re: [PR] Minor: add "clickbench extended" queries to slt tests [datafusion]

2024-08-01 Thread via GitHub
alamb merged PR #11763: URL: https://github.com/apache/datafusion/pull/11763 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] Minor: add "clickbench extended" queries to slt tests [datafusion]

2024-08-01 Thread via GitHub
alamb commented on PR #11763: URL: https://github.com/apache/datafusion/pull/11763#issuecomment-2263493350 Thanks for the review @comphead -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spe

[PR] chore: Assert that parquet files are written with dictionary-encoded data [datafusion-comet]

2024-08-01 Thread via GitHub
andygrove opened a new pull request, #752: URL: https://github.com/apache/datafusion-comet/pull/752 ## Which issue does this PR close? N/A ## Rationale for this change Many tests use the following pattern to cover testing with dictionary-encoded data.

[I] Many tests appear to test dictionary-encoded arrays but actually do not [datafusion-comet]

2024-08-01 Thread via GitHub
andygrove opened a new issue, #753: URL: https://github.com/apache/datafusion-comet/issues/753 ### Describe the bug Many tests use the following pattern to cover testing with dictionary-encoded data. ```scala Seq("true", "false").foreach { dictionary => withPa

Re: [I] COUNT(DISTINCT) on StringView panics: `unreachable code: Utf8/Binary should use ArrowBytesSet` [datafusion]

2024-08-01 Thread via GitHub
XiangpengHao commented on issue #11767: URL: https://github.com/apache/datafusion/issues/11767#issuecomment-2263518796 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[PR] Fix bug that `COUNT(DISTINCT)` on StringView panics [datafusion]

2024-08-01 Thread via GitHub
XiangpengHao opened a new pull request, #11768: URL: https://github.com/apache/datafusion/pull/11768 ## Which issue does this PR close? Closes #11767 ## Rationale for this change It is a typo ## What changes are included in this PR? ## Are th

Re: [I] Warnings about illegal reflective access [datafusion-comet]

2024-08-01 Thread via GitHub
parthchandra commented on issue #751: URL: https://github.com/apache/datafusion-comet/issues/751#issuecomment-2263569717 I don't think reverting this is needed. This warning can be suppressed with the JVM option [--add-opens](https://docs.oracle.com/en/java/javase/17/migrate/migrating-j

Re: [PR] Fix bug that `COUNT(DISTINCT)` on StringView panics [datafusion]

2024-08-01 Thread via GitHub
alamb commented on PR #11768: URL: https://github.com/apache/datafusion/pull/11768#issuecomment-2263581921 Thank you @XiangpengHao -- I ahve some tests written for this that I will push to this branch -- This is an automated message from the Apache Git Service. To respond to the message,

Re: [PR] Fix bug that `COUNT(DISTINCT)` on StringView panics [datafusion]

2024-08-01 Thread via GitHub
alamb commented on code in PR #11768: URL: https://github.com/apache/datafusion/pull/11768#discussion_r1700549078 ## datafusion/functions-aggregate/src/count.rs: ## @@ -237,14 +237,17 @@ impl AggregateUDFImpl for Count { Box::new(BytesDistinctCountAccumulator::

Re: [PR] Fix bug that `COUNT(DISTINCT)` on StringView panics [datafusion]

2024-08-01 Thread via GitHub
XiangpengHao commented on PR #11768: URL: https://github.com/apache/datafusion/pull/11768#issuecomment-2263635809 Do you @alamb have any thoughts on why clickbench query 5 won't previously panic? ```sql SELECT COUNT(DISTINCT "SearchPhrase") FROM hits; ``` I find only whe

Re: [I] Support string concat `||` for `StringViewArray` [datafusion]

2024-08-01 Thread via GitHub
dharanad commented on issue #11766: URL: https://github.com/apache/datafusion/issues/11766#issuecomment-2263646758 i'm really interested in working on this issue. Feeling little doubtful, but let me try. I'm might seek some help, if not i will unassign myself . -- This is an automated mes

[PR] [Minor] Refactor approx_percentile [datafusion]

2024-08-01 Thread via GitHub
Dandandan opened a new pull request, #11769: URL: https://github.com/apache/datafusion/pull/11769 ## Which issue does this PR close? Some minor refactoring * Change count to be of type `u64` * Some small other misc type improvements ## Rationale for this change

Re: [PR] fix: Remove castting on decimals with a small precision to decimal256 [datafusion-comet]

2024-08-01 Thread via GitHub
kazuyukitanimura commented on PR #741: URL: https://github.com/apache/datafusion-comet/pull/741#issuecomment-2263656950 Thank you @andygrove merged -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] fix: Remove castting on decimals with a small precision to decimal256 [datafusion-comet]

2024-08-01 Thread via GitHub
kazuyukitanimura merged PR #741: URL: https://github.com/apache/datafusion-comet/pull/741 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubsc

Re: [PR] Fix bug that `COUNT(DISTINCT)` on StringView panics [datafusion]

2024-08-01 Thread via GitHub
alamb commented on PR #11768: URL: https://github.com/apache/datafusion/pull/11768#issuecomment-2263684224 > Do you @alamb have any thoughts on why clickbench query 5 won't previously panic? Yes, I think it is because ``` SELECT COUNT(DISTINCT "SearchPhrase") FROM hits; ```

Re: [PR] Fix bug that `COUNT(DISTINCT)` on StringView panics [datafusion]

2024-08-01 Thread via GitHub
XiangpengHao commented on PR #11768: URL: https://github.com/apache/datafusion/pull/11768#issuecomment-2263689951 > Is rewritten to a group by without distinct here I see, that makes sense -- This is an automated message from the Apache Git Service. To respond to the message, please

  1   2   >