Re: [I] Support partitioned writes [iceberg-python]

via GitHub Thu, 25 Jan 2024 06:10:27 -0800


Fokko commented on issue #208:
URL: https://github.com/apache/iceberg-python/issues/208#issuecomment-1910290987


   Hey @jqin61
   
   Thanks for the elaborate post, and sorry for my slow reply. I did want to 
take the time to write a good answer.
   
   Probably the following statement needs another map step:
   
   ```python
   partitions: list[dict] = pyarrow.compute.unique(arrow_table)
   ```
   
   The above is true for an identity partition, but often we take truncate the 
month, day or hour from a field, and use that as a partition. Another example 
is the bucketing partition where we hash the field, and determine in which 
bucket it will fall.
   
   With regard of utilizing the Arrow primitives that are already there. I 
think that's a great idea, we just have to make sure that they are flexible 
enough for Iceberg. There are a couple of questions that pop into my mind:
   
   - Can we support all Icebergs partition strategies, such as bucketing, 
truncating etc.
   - Are we able to extract the metrics similar that we do for non-partitioned 
writes.
   
   @asheeshgarg Thanks for giving it a try. Looking at the schema, there is a 
discrapency. The test-data that you generate has `value_1` as an int64, and the 
table expects a string. I think the error is correct here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Support partitioned writes [iceberg-python]

Reply via email to