gene-bordegaray opened a new issue, #21992:
URL: https://github.com/apache/datafusion/issues/21992
I’d like to restart the partitioning side of this discussion separately from
the dynamic-filter PRs and threads (see #21207 for more background)
The main issue seems to be that DataFusion cannot always represent the
physical partitioning that some data sources actually have. In our case,
range-partitioned data has had to claim `Partitioning::Hash`, which lets us
avoid repartitioning but makes later optimizer and dynamic-filter decisions
brittle due to this false advertisement.
Today we are in this scenario:
```text
Data is actually range-partitioned:
- partition 0: key < 100
- partition 1: 100 <= key < 200
- partition 2: 200 <= key < 300
To mimic this we tell DataFusion our partitioning properties are:
- Partitioning::Hash([key], 3)
```
Those are not the same partitioning scheme.
The general goal should be that DataFusion should be able to describe
physical partitioning truthfully, be able to compare two partitionings for
compatibility, and use that information only when compatibility is proven.
This is well shown in dynamic filtering:
```text
Build side partitioning:
p0: key < 100
p1: 100 <= key < 200
p2: 200 <= key < 300
Probe side partitioning:
p0: key < 100
p1: 100 <= key < 200
p2: 200 <= key < 300
```
These are compatible as every partition X on the build side is compatible
with partition X on the probe side. With this information, optimizations can
safely use partition-local behavior:
```text
if partitioning is compatible:
use partition-local filter (more selective)
else:
use global/safe fallback
```
A possible direction is to evolve Partitioning toward an extensible
abstraction, such as a `PhysicalPartitioning` trait, and add range partitioning
as the first use case. Longer term, built-ins like hash partitioning could move
behind the same abstraction.
This would give us a cleaner foundation for:
- range/file partitioned sources
- partition-local dynamic filters when both sides share the same partition
map #21832
Does this direction seem interesting to others in the community nd any
thoughts on this proposal?
cc: @NGA-TRAN @alamb @jayshrivastava @gabotechs @adriangb
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]