gene-bordegaray commented on issue #21207:
URL: https://github.com/apache/datafusion/issues/21207#issuecomment-4366205221

   > [@gene-bordegaray](https://github.com/gene-bordegaray) -- I started 
reading this doc, but I didn't really undesrtand how "Key Partitioned" would be 
represented -- specifically, what expression would be used to
   > 
   > 1. determine which partition any particular row belongs to
   > 2. determine how many partitions there are (at plan time, without access 
to the data)
   > 
   > Hash partitioning satisfies both, with a parameter `N` (the number of 
partitions)
   > 
   > 1. row's partition is `hash(cols) % N`
   > 2. The number of partitions is `N` (not dependent on the data)
   > 
   > In your example of monthly partitioned data, how would it be represented 
in a general form?
   > 
   > Spark has some `DISTRIBUTE BY` clause 
https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-distribute-by.html
 but that doesn't seem to be able to provide the number of partitions up front
   > 
   > Also, are you trying to design a system that can do fancy things with 
heirarchal partitioning (e.g. if data is partitioned by day, then by combining 
the right partitions you could join it with data partitioned by month)?
   
   `KeyPartitioned` was too vague and just something to get some questions like 
these started. For the monthly/range case, I would not model it as `DISTRIBUTE 
BY` to start as this seems to be a way at query time to group data in a certain 
fashion. I think the path I am imagining is providing the `FileScanConfig` / 
`DataSourceExec` with physical partition metadata needed to make optimizations 
as needed:
   
   ```text
   Partitioning::Range {
       exprs: [date_col],
       ranges: [
           p0: [2026-01-01, 2026-02-01),
           p1: [2026-02-01, 2026-03-01),
           p2: [2026-03-01, 2026-04-01),
       ]
   }
   ```
   
   Then:
   1. A row belongs to the partition whose range contains date_col.
   2. The number of partitions is known at plan time from the range list length 
(we could/should add some way of combining partitions and claculating from 
this).
   
   So that would come from the data source/catalog, not by scanning. If the 
source doesn't provide partition definition, then (at least for a first 
iteration) I don't think it should advertise the partitioning. I could see a 
future where we have a `DISTRIBUTE BY` clause that can effectively repartition 
/ group the data by the column or schema given.
   
   I am not trying to solve hierarchical partition compatibility in the first 
version. For example, “day partitions can be grouped into month partitions” 
seems useful eventually, but I think the first iteration should be simpler: 
compatibility / partition mapping. this will be a strong baseline to build on 
further and enough to avoid the false advertisement we are doing with 
range-partitioned data as Hash.
   
   I opened #21992 to move this partitioning design out of the dynamic-filter 
thread. My current thinking is that we start with range partitioning as the 
concrete case, and model it around a design where partitioning becomes 
customizable like a `PhysicalPartitioning` trait later.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to