westonpace opened a new issue, #34136: URL: https://github.com/apache/arrow/issues/34136
### Describe the enhancement requested Every node has an "ordering" which describes what the batch index of the batches produced by that node corresponds to. Source nodes will generally have an ordering based on the data they are generating. For in-memory data this will be implicit "row number" ordering (done in this PR). For the scanner this will generally be implicit "file number / row number" ordering (future PR). In both cases the user should eventually be able to assert an explicit ordering if they know for a fact that their data is ordered (future PR). If the data is partitioned the scanner could generate an explicit ordering based on the partition columns (future PR). Mapping nodes like filter/project/fetch will simply propagate the ordering of their input. Some nodes, such as order_by (future PR) and asof join (this PR) will establish a new ordering. Other nodes, such as the aggregate node, the hash-join node, and the union node will destroy the ordering (this PR). Their output will have no meaningful order. Although, in all these cases one could devise an alternative order-preserving approach (future PR): * An aggregate node could pretty easily establish an ordering based on the keys or an implicit ordering if there are no keys (if you only have 1 row then it is ordered :) * A hash-join node could similarly choose to order by one or more of the join keys (unclear at the moment if there would be much cost associated with this) * To preserve ordering, a union node would need to read from all inputs at the same time and output in round-robin fashion. Alternatively, a union node could choose to fully drain one input before starting on the next input. Both approaches would have performance implications The fetch node will now fail to validate if it is placed in a plan that does not make sense (e.g. immediately following a hash-join node). The sink node, if sequenced output is requested, will similarly fail to validate if the plan does not have an established ordering. This all sounds far more complex than it is :) ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org