Re: [I] Support partitioned writes [iceberg-python]

via GitHub Wed, 24 Jan 2024 13:36:28 -0800


asheeshgarg commented on issue #208:
URL: https://github.com/apache/iceberg-python/issues/208#issuecomment-1908955580


   @Fokko @jqin61 
   Today I tried basic example on partition write
   from pyiceberg.io.pyarrow import schema_to_pyarrow
   import pyarrow as pa
   from pyarrow import parquet as pq
   data = {'key': ['001', '001', '002', '002'],
           'value_1': [10, 20, 100, 200],
           'value_2': ['a', 'b', 'a', 'b']}
   my_partitioning = pa.dataset.partitioning(pa.schema([pa.field("key", 
pa.string())]), flavor='hive')
   TABLE_SCHEMA = Schema(
       NestedField(field_id=1, name="key", field_type=StringType(), 
required=False),
       NestedField(field_id=2, name="value_1", field_type=StringType(), 
required=False),
       NestedField(field_id=3, name="value_2", field_type=StringType(), 
required=False),
   )
   schema = schema_to_pyarrow(TABLE_SCHEMA)
   patbl = pa.Table.from_pydict(data)
   
pq.write_to_dataset(patbl,'partitioned_data',partitioning=my_partitioning,schema=schema)
   
   If I don't use schema in write it works fine. But if I pass the schema  
create schema = schema_to_pyarrow(TABLE_SCHEMA)
   It fails with 
   ArrowTypeError: Item has schema
   key: string
   value_1: int64
   value_2: string
   which does not match expected schema
   key: string
     -- field metadata --
     PARQUET:field_id: '1'
   value_1: string
     -- field metadata --
     PARQUET:field_id: '2'
   value_2: string
     -- field metadata --
     PARQUET:field_id: '3'
   
   
   I also tried the parquet write the way we are doing currenlty
   writer = pq.ParquetWriter("test", schema=schema, version="1.0") 
   writer.write_table(patbl)
   ValueError: Table schema does not match schema used to create file: 
   table:
   key: string
   value_1: int64
   value_2: string vs. 
   file:
   key: string
     -- field metadata --
     PARQUET:field_id: '1'
   value_1: string
     -- field metadata --
     PARQUET:field_id: '2'
   value_2: string
     -- field metadata --
     PARQUET:field_id: '3
   
   Do we do any other transformation for the schema before we write in current 
write support.
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Support partitioned writes [iceberg-python]

Reply via email to