westonpace opened a new issue, #34265:
URL: https://github.com/apache/arrow/issues/34265

   ### Describe the enhancement requested
   
   This is a parent issue describing a general feature that should probably end 
up being composed of multiple PRs.  I'm not in any rush to implement this 
myself but recently read up on pivoting and wanted to layout a basic framework 
should someone want to come tackle this in the future.
   
   I think the most intuitive understanding of the pivot methods is probably 
tidyr's categorization of 
[pivot_wider](https://tidyr.tidyverse.org/reference/pivot_wider.html) and 
[pivot_longer](https://tidyr.tidyverse.org/reference/pivot_longer.html).  
However, much of the complexity of those methods is simply figuring out the 
pivot specification.
   
   ### Pivot Wider
   
   This is the more complex of the two operations.  However, it is something 
that can (mostly) be done via traditional operators.  Each new column is first 
created by a projection of `CASE WHEN <names_col> = <name> THEN <value_col> 
ELSE NULL`.  These projections are then fed into an aggregation (using some 
appropriate aggregation function, ala tidyr's `values_fn`).  The aggregation's 
grouping keys are what tidyr calls `id_cols` (by default this is every column 
that isn't a name or a value).  Any column that is not a name, value, or id is 
dropped (this is consistent with tidyr).
   
   However, in order to create the above query, you would need to know all 
possible values of `names_col` which is not something that can be derived from 
the schema.  This is why pivot wider is not traditionally an SQL operation.  
The typical workaround is to scan the data twice (e.g. this is what tidyr's 
implementation does).  First to grab the names and second to do the grouping.
   
   SQL server offers us some syntactic sugar that is easier to type up than all 
of the complex "case when" queries in the form of 
[PIVOT](https://learn.microsoft.com/en-us/sql/t-sql/queries/from-using-pivot-and-unpivot?view=sql-server-ver16).
  If we are going to do anything in Acero for Pivot it should be supporting 
this syntactic sugar node.
   
    * [ ] Create a node factory in Acero that operates similarly to SQL 
server's PIVOT
   
   ### Pivot Longer
   
   By contrast, pivot longer is a much simpler operation.  In fact, it can be a 
very cheap zero-copy operation (although it significantly increases the # of 
cells so any downstream operations on top of the pivoted data will inevitably 
be more expensive).  However, there is no way to achieve pivot longer with 
traditional SQL and it will need a dedicated node.
   
    * [ ] Create a pivot_longer node
   
   SQL server refers to this as UNPIVOT and we should similarly add something 
that has similar syntax and maps to the basic pivot longer node.
   
    * [ ] Create a node factory in Acero that operates similarly to SQL 
server's UNPIVOT
   
   Finally, tidyr's implementation has some nice convenience functions which 
are able to create pivot specifications from the schema (the most complex of 
these applies regular expressions to column names).  These could be implemented 
once in C++ and exposed as helper functions to help create an options object 
for the node.
   
    * [ ] Create a utility function that will help create a pivot longer 
specification from a schema.
   
   Also, tidyr has a rather annoying `cols_vary` property which determines the 
output row order.  This has two options, `fastest` and `slowest`.  The most 
performant way to pivot longer, using zero copy, is a third way to do this 
which does not quite respect either of these choices.  It would be fairly 
straightforward to implement `fastest` by interleaving the data from the excess 
record batches though this would incur the cost of one copy.
   
    * [ ] Add support for interleaving rows when pivoting longer similar to 
tidyr's `cols_vary=fastest`.
   
   The `cols_vary=slowest` option would require accumulating all data and be 
quite costly.  Though a lower priority and something that should generally be 
discouraged it would be possible to implement.
   
    * [ ] Add support for grouping by row templates when pivoting longer, 
similar to tidyr's `cols_vary=slowest`.
   
   During implementation some other misc future work might pop up that could be 
captured in this parent issue:
   
    * [ ] Add support for nested column references in the pivot longer node.
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to