Re: [I] Support partitioned writes [iceberg-python]

via GitHub Fri, 26 Jan 2024 13:47:29 -0800


asheeshgarg commented on issue #208:
URL: https://github.com/apache/iceberg-python/issues/208#issuecomment-1912738795


   @Fokko @syun64 @syun64 another option I can think is use polars to do it 
simple example below with hashing and partitioning sorting in partition. Where 
all the partition is handle by rust layer in Polars and we write parquet based 
on arrow table returned.
   Not sure if we want to add it as dependency? We can do custom transforms 
like hours etc we have in iceberg as well easily.  
   
   import pyarrow as pa
   import pyarrow.compute as pc
   import polars as pl
   t = pa.table({'strings':["A", "A", "B", "A"],'ints':[2, 1, 3, 4]})
   df=pl.from_arrow(t)
   
   N = 2
   tables=(df.with_columns([
        (pl.col("strings").hash() % N).alias("partition_id")
   ]).partition_by("partition_id"))
   
   for tbl in tables:
       print(tbl.to_arrow().sort_by("ints"))


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Support partitioned writes [iceberg-python]

Reply via email to