[GitHub] [arrow] westonpace opened a new issue, #34265: [C++] Pivot Support in Acero

via GitHub Mon, 20 Feb 2023 12:43:46 -0800


westonpace opened a new issue, #34265:
URL: https://github.com/apache/arrow/issues/34265

### Describe the enhancement requested

This is a parent issue describing a general feature that should probably end
up being composed of multiple PRs. I'm not in any rush to implement this
myself but recently read up on pivoting and wanted to layout a basic framework
should someone want to come tackle this in the future.

I think the most intuitive understanding of the pivot methods is probably
tidyr's categorization of
[pivot_wider](https://tidyr.tidyverse.org/reference/pivot_wider.html) and
[pivot_longer](https://tidyr.tidyverse.org/reference/pivot_longer.html).
However, much of the complexity of those methods is simply figuring out the
pivot specification.

### Pivot Wider

This is the more complex of the two operations. However, it is something
that can (mostly) be done via traditional operators. Each new column is first
created by a projection of `CASE WHEN <names_col> = <name> THEN <value_col>
ELSE NULL`. These projections are then fed into an aggregation (using some
appropriate aggregation function, ala tidyr's `values_fn`). The aggregation's
grouping keys are what tidyr calls `id_cols` (by default this is every column
that isn't a name or a value). Any column that is not a name, value, or id is
dropped (this is consistent with tidyr).

However, in order to create the above query, you would need to know all
possible values of `names_col` which is not something that can be derived from
the schema. This is why pivot wider is not traditionally an SQL operation.
The typical workaround is to scan the data twice (e.g. this is what tidyr's
implementation does). First to grab the names and second to do the grouping.

SQL server offers us some syntactic sugar that is easier to type up than all
of the complex "case when" queries in the form of
[PIVOT](https://learn.microsoft.com/en-us/sql/t-sql/queries/from-using-pivot-and-unpivot?view=sql-server-ver16).
If we are going to do anything in Acero for Pivot it should be supporting
this syntactic sugar node.

* [ ] Create a node factory in Acero that operates similarly to SQL
server's PIVOT

### Pivot Longer

By contrast, pivot longer is a much simpler operation. In fact, it can be a
very cheap zero-copy operation (although it significantly increases the # of
cells so any downstream operations on top of the pivoted data will inevitably
be more expensive). However, there is no way to achieve pivot longer with
traditional SQL and it will need a dedicated node.

* [ ] Create a pivot_longer node

SQL server refers to this as UNPIVOT and we should similarly add something
that has similar syntax and maps to the basic pivot longer node.

* [ ] Create a node factory in Acero that operates similarly to SQL
server's UNPIVOT

Finally, tidyr's implementation has some nice convenience functions which
are able to create pivot specifications from the schema (the most complex of
these applies regular expressions to column names). These could be implemented
once in C++ and exposed as helper functions to help create an options object
for the node.

* [ ] Create a utility function that will help create a pivot longer
specification from a schema.

Also, tidyr has a rather annoying `cols_vary` property which determines the
output row order. This has two options, `fastest` and `slowest`. The most
performant way to pivot longer, using zero copy, is a third way to do this
which does not quite respect either of these choices. It would be fairly
straightforward to implement `fastest` by interleaving the data from the excess
record batches though this would incur the cost of one copy.

* [ ] Add support for interleaving rows when pivoting longer similar to
tidyr's `cols_vary=fastest`.

The `cols_vary=slowest` option would require accumulating all data and be
quite costly. Though a lower priority and something that should generally be
discouraged it would be possible to implement.

* [ ] Add support for grouping by row templates when pivoting longer,
similar to tidyr's `cols_vary=slowest`.

During implementation some other misc future work might pop up that could be
captured in this parent issue:

* [ ] Add support for nested column references in the pivot longer node.

### Component(s)

C++

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace opened a new issue, #34265: [C++] Pivot Support in Acero

Reply via email to