westonpace opened a new issue, #34136:
URL: https://github.com/apache/arrow/issues/34136

   ### Describe the enhancement requested
   
   Every node has an "ordering" which describes what the batch index of the 
batches produced by that node corresponds to.
   
   Source nodes will generally have an ordering based on the data they are 
generating.  For in-memory data this will be implicit "row number" ordering 
(done in this PR).  For the scanner this will generally be implicit "file 
number / row number" ordering (future PR).  In both cases the user should 
eventually be able to assert an explicit ordering if they know for a fact that 
their data is ordered (future PR).  If the data is partitioned the scanner 
could generate an explicit ordering based on the partition columns (future PR).
   
   Mapping nodes like filter/project/fetch will simply propagate the ordering 
of their input.
   
   Some nodes, such as order_by (future PR) and asof join (this PR) will 
establish a new ordering.
   
   Other nodes, such as the aggregate node, the hash-join node, and the union 
node will destroy the ordering (this PR).  Their output will have no meaningful 
order.  Although, in all these cases one could devise an alternative 
order-preserving approach (future PR):
   
    * An aggregate node could pretty easily establish an ordering based on the 
keys or an implicit ordering if there are no keys (if you only have 1 row then 
it is ordered :)
    * A hash-join node could similarly choose to order by one or more of the 
join keys (unclear at the moment if there would be much cost associated with 
this)
    * To preserve ordering, a union node would need to read from all inputs at 
the same time and output in round-robin fashion.  Alternatively, a union node 
could choose to fully drain one input before starting on the next input.  Both 
approaches would have performance implications
   
   The fetch node will now fail to validate if it is placed in a plan that does 
not make sense (e.g. immediately following a hash-join node).
   
   The sink node, if sequenced output is requested, will similarly fail to 
validate if the plan does not have an established ordering.
   
   This all sounds far more complex than it is :)
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to