alamb commented on issue #21207: URL: https://github.com/apache/datafusion/issues/21207#issuecomment-4319968042
> Note: There is a in depth pdf I attached to the PR explaining the partitioning and why it is needed in comparison to `Hash` if anyone would like a deep explanation of why some new notion of partitioning is needed. Here it is https://github.com/user-attachments/files/23805659/Issue.18777.Parallelize.Key.Partitioned.Data.pdf @gene-bordegaray -- I started reading this doc, but I didn't really undesrtand how "Key Partitioned" would be represented -- specifically, what expression would be used to 1. determine which partition any particular row belongs to 2. determine how many partitions there are (at plan time, without access to the data) Hash partitioning satisfies both, with a parameter `N` (the number of partitions) 1. row's partition is `hash(cols) % N` 2. The number of partitions is `N` (not dependent on the data) In your example of monthly partitioned data, how would it be represented in a general form? Spark has some `DISTRIBUTE BY` clause https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-distribute-by.html but that doesn't seem to be able to provide the number of partitions up front Also, are you trying to design a system that can do fancy things with heirarchal partitioning (e.g. if data is partitioned by day, then by combining the right partitions you could join it with data partitioned by month)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
