alamb commented on issue #21207:
URL: https://github.com/apache/datafusion/issues/21207#issuecomment-4319968042

   > Note: There is a in depth pdf I attached to the PR explaining the 
partitioning and why it is needed in comparison to `Hash` if anyone would like 
a deep explanation of why some new notion of partitioning is needed. Here it is 
https://github.com/user-attachments/files/23805659/Issue.18777.Parallelize.Key.Partitioned.Data.pdf
   
   @gene-bordegaray -- I started reading this doc, but I didn't really 
undesrtand how "Key Partitioned" would be represented -- specifically, what 
expression would be used to 
   1. determine which partition any particular row belongs to
   2. determine how many partitions there are (at plan time, without access to 
the data)
   
   Hash partitioning satisfies both, with a parameter `N` (the number of 
partitions)
   1. row's partition is `hash(cols) % N` 
   2. The number of partitions is `N` (not dependent on the data)
   
   In your example of monthly partitioned data, how would it be represented in 
a general form?
   
   Spark has some `DISTRIBUTE BY` clause 
https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-distribute-by.html
   but that doesn't seem to be able to provide the number of partitions up front
   
   Also, are you trying to design a system that can do fancy things with 
heirarchal partitioning (e.g. if data is partitioned by day, then by combining 
the right partitions you could join it with data partitioned by month)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to