gene-bordegaray opened a new issue, #21992:
URL: https://github.com/apache/datafusion/issues/21992

   I’d like to restart the partitioning side of this discussion separately from 
the dynamic-filter PRs and threads (see #21207 for more background)
   
   The main issue seems to be that DataFusion cannot always represent the 
physical partitioning that some data sources actually have. In our case, 
range-partitioned data has had to claim `Partitioning::Hash`, which lets us 
avoid repartitioning but makes later optimizer and dynamic-filter decisions 
brittle due to this false advertisement.
   
   Today we are in this scenario:
   
   ```text
   Data is actually range-partitioned:
   - partition 0: key < 100
   - partition 1: 100 <= key < 200
   - partition 2: 200 <= key < 300
   
   To mimic this we tell DataFusion our partitioning properties are:
   - Partitioning::Hash([key], 3)
   ```
   
   Those are not the same partitioning scheme.
   
   The general goal should be that DataFusion should be able to describe 
physical partitioning truthfully, be able to compare two partitionings for 
compatibility, and use that information only when compatibility is proven.
   
   This is well shown in dynamic filtering:
   
   ```text
   Build side partitioning:
     p0: key < 100
     p1: 100 <= key < 200
     p2: 200 <= key < 300
   
   Probe side partitioning:
     p0: key < 100
     p1: 100 <= key < 200 
     p2: 200 <= key < 300
   ```
   
   These are compatible as every partition X on the build side is compatible 
with partition X on the probe side. With this information, optimizations can 
safely use partition-local behavior:
   
   ```text
   if partitioning is compatible:
     use partition-local filter (more selective)
   else:
     use global/safe fallback
   ```
   
   A possible direction is to evolve Partitioning toward an extensible 
abstraction, such as a `PhysicalPartitioning` trait, and add range partitioning 
as the first use case. Longer term, built-ins like hash partitioning could move 
behind the same abstraction.
   
   This would give us a cleaner foundation for:
   - range/file partitioned sources
   - partition-local dynamic filters when both sides share the same partition 
map #21832
   
   Does this direction seem interesting to others in the community nd any 
thoughts on this proposal?
   
   cc: @NGA-TRAN @alamb @jayshrivastava @gabotechs @adriangb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to