westonpace opened a new issue, #34265: URL: https://github.com/apache/arrow/issues/34265
### Describe the enhancement requested This is a parent issue describing a general feature that should probably end up being composed of multiple PRs. I'm not in any rush to implement this myself but recently read up on pivoting and wanted to layout a basic framework should someone want to come tackle this in the future. I think the most intuitive understanding of the pivot methods is probably tidyr's categorization of [pivot_wider](https://tidyr.tidyverse.org/reference/pivot_wider.html) and [pivot_longer](https://tidyr.tidyverse.org/reference/pivot_longer.html). However, much of the complexity of those methods is simply figuring out the pivot specification. ### Pivot Wider This is the more complex of the two operations. However, it is something that can (mostly) be done via traditional operators. Each new column is first created by a projection of `CASE WHEN <names_col> = <name> THEN <value_col> ELSE NULL`. These projections are then fed into an aggregation (using some appropriate aggregation function, ala tidyr's `values_fn`). The aggregation's grouping keys are what tidyr calls `id_cols` (by default this is every column that isn't a name or a value). Any column that is not a name, value, or id is dropped (this is consistent with tidyr). However, in order to create the above query, you would need to know all possible values of `names_col` which is not something that can be derived from the schema. This is why pivot wider is not traditionally an SQL operation. The typical workaround is to scan the data twice (e.g. this is what tidyr's implementation does). First to grab the names and second to do the grouping. SQL server offers us some syntactic sugar that is easier to type up than all of the complex "case when" queries in the form of [PIVOT](https://learn.microsoft.com/en-us/sql/t-sql/queries/from-using-pivot-and-unpivot?view=sql-server-ver16). If we are going to do anything in Acero for Pivot it should be supporting this syntactic sugar node. * [ ] Create a node factory in Acero that operates similarly to SQL server's PIVOT ### Pivot Longer By contrast, pivot longer is a much simpler operation. In fact, it can be a very cheap zero-copy operation (although it significantly increases the # of cells so any downstream operations on top of the pivoted data will inevitably be more expensive). However, there is no way to achieve pivot longer with traditional SQL and it will need a dedicated node. * [ ] Create a pivot_longer node SQL server refers to this as UNPIVOT and we should similarly add something that has similar syntax and maps to the basic pivot longer node. * [ ] Create a node factory in Acero that operates similarly to SQL server's UNPIVOT Finally, tidyr's implementation has some nice convenience functions which are able to create pivot specifications from the schema (the most complex of these applies regular expressions to column names). These could be implemented once in C++ and exposed as helper functions to help create an options object for the node. * [ ] Create a utility function that will help create a pivot longer specification from a schema. Also, tidyr has a rather annoying `cols_vary` property which determines the output row order. This has two options, `fastest` and `slowest`. The most performant way to pivot longer, using zero copy, is a third way to do this which does not quite respect either of these choices. It would be fairly straightforward to implement `fastest` by interleaving the data from the excess record batches though this would incur the cost of one copy. * [ ] Add support for interleaving rows when pivoting longer similar to tidyr's `cols_vary=fastest`. The `cols_vary=slowest` option would require accumulating all data and be quite costly. Though a lower priority and something that should generally be discouraged it would be possible to implement. * [ ] Add support for grouping by row templates when pivoting longer, similar to tidyr's `cols_vary=slowest`. During implementation some other misc future work might pop up that could be captured in this parent issue: * [ ] Add support for nested column references in the pivot longer node. ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
