gene-bordegaray commented on issue #21207: URL: https://github.com/apache/datafusion/issues/21207#issuecomment-4366205221
> [@gene-bordegaray](https://github.com/gene-bordegaray) -- I started reading this doc, but I didn't really undesrtand how "Key Partitioned" would be represented -- specifically, what expression would be used to > > 1. determine which partition any particular row belongs to > 2. determine how many partitions there are (at plan time, without access to the data) > > Hash partitioning satisfies both, with a parameter `N` (the number of partitions) > > 1. row's partition is `hash(cols) % N` > 2. The number of partitions is `N` (not dependent on the data) > > In your example of monthly partitioned data, how would it be represented in a general form? > > Spark has some `DISTRIBUTE BY` clause https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-distribute-by.html but that doesn't seem to be able to provide the number of partitions up front > > Also, are you trying to design a system that can do fancy things with heirarchal partitioning (e.g. if data is partitioned by day, then by combining the right partitions you could join it with data partitioned by month)? `KeyPartitioned` was too vague and just something to get some questions like these started. For the monthly/range case, I would not model it as `DISTRIBUTE BY` to start as this seems to be a way at query time to group data in a certain fashion. I think the path I am imagining is providing the `FileScanConfig` / `DataSourceExec` with physical partition metadata needed to make optimizations as needed: ```text Partitioning::Range { exprs: [date_col], ranges: [ p0: [2026-01-01, 2026-02-01), p1: [2026-02-01, 2026-03-01), p2: [2026-03-01, 2026-04-01), ] } ``` Then: 1. A row belongs to the partition whose range contains date_col. 2. The number of partitions is known at plan time from the range list length (we could/should add some way of combining partitions and claculating from this). So that would come from the data source/catalog, not by scanning. If the source doesn't provide partition definition, then (at least for a first iteration) I don't think it should advertise the partitioning. I could see a future where we have a `DISTRIBUTE BY` clause that can effectively repartition / group the data by the column or schema given. I am not trying to solve hierarchical partition compatibility in the first version. For example, “day partitions can be grouped into month partitions” seems useful eventually, but I think the first iteration should be simpler: compatibility / partition mapping. this will be a strong baseline to build on further and enough to avoid the false advertisement we are doing with range-partitioned data as Hash. I opened #21992 to move this partitioning design out of the dynamic-filter thread. My current thinking is that we start with range partitioning as the concrete case, and model it around a design where partitioning becomes customizable like a `PhysicalPartitioning` trait later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
