hhhizzz opened a new issue, #23027: URL: https://github.com/apache/datafusion/issues/23027
### Describe the bug ## Problem After `73e3c2a617` / #22646 (`chore: Add primary key constraints for TPC-H, TPC-DS`), TPC-DS q39 shows a large performance regression. The regression appears related to SQL aggregate planning with functional dependencies from primary key constraints. In q39, the query groups by key columns such as: - `item.i_item_sk` - `warehouse.w_warehouse_sk` After primary key constraints are present, the optimized aggregate plan expands the `GROUP BY` keys with many functionally dependent columns from those tables, even though the query does not need those columns after aggregation. Examples observed in the plan include: - `item.i_item_id` - `item.i_product_name` - `warehouse.w_gmt_offset` This makes the aggregate keys much wider and also causes extra columns to be projected from scans and carried through joins/aggregation. ## Regression Shape The regression pattern is: 1. TPC-DS table schemas include primary key constraints. 2. SQL planning recognizes functional dependencies from those constraints. 3. Aggregate planning expands grouped primary key columns into dependent columns. 4. The expansion includes columns that are not referenced by the query output. 5. The plan carries much wider group keys than needed. 6. q39 runtime increases substantially. This looks like a planner-level issue rather than a Parquet reader issue: disabling the TPC-DS primary key constraints makes q39 return to the previous timing range. ## Benchmark Results Environment: ```text TPC-DS SF10 CPU: 24 Cores Rounds: 10 Iterations: 1 Parquet pushdown filters: true Parquet reorder filters: true Parquet pruning: true ``` With TPC-DS primary key constraints enabled: ```text q39 current mean: ~8301 ms ``` With TPC-DS primary key constraints disabled for diagnosis: ```text q39 current total: 14288.69 ms over 10 rounds q39 current mean: ~1428.87 ms geomean current/main: 0.983399 failures: 0 ``` So q39 is roughly: ```text ~8301 ms -> ~1429 ms ``` when primary key constraints are removed from the TPC-DS schema setup. ## Expected Behavior Functional dependency support should allow queries to select columns determined by grouped keys, but aggregate planning should not add unreferenced functionally dependent columns to the physical/logical group keys. Only columns actually required after aggregation should need to appear in aggregate output/grouping. ### To Reproduce Run TPCDS q39 before and after the https://github.com/apache/datafusion/pull/22646 ### Expected behavior _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
