asheeshgarg commented on issue #208:
URL: https://github.com/apache/iceberg-python/issues/208#issuecomment-1908955580
@Fokko @jqin61
Today I tried basic example on partition write
from pyiceberg.io.pyarrow import schema_to_pyarrow
import pyarrow as pa
from pyarrow import parquet as pq
data = {'key': ['001', '001', '002', '002'],
'value_1': [10, 20, 100, 200],
'value_2': ['a', 'b', 'a', 'b']}
my_partitioning = pa.dataset.partitioning(pa.schema([pa.field("key",
pa.string())]), flavor='hive')
TABLE_SCHEMA = Schema(
NestedField(field_id=1, name="key", field_type=StringType(),
required=False),
NestedField(field_id=2, name="value_1", field_type=StringType(),
required=False),
NestedField(field_id=3, name="value_2", field_type=StringType(),
required=False),
)
schema = schema_to_pyarrow(TABLE_SCHEMA)
patbl = pa.Table.from_pydict(data)
pq.write_to_dataset(patbl,'partitioned_data',partitioning=my_partitioning,schema=schema)
If I don't use schema in write it works fine. But if I pass the schema
create schema = schema_to_pyarrow(TABLE_SCHEMA)
It fails with
ArrowTypeError: Item has schema
key: string
value_1: int64
value_2: string
which does not match expected schema
key: string
-- field metadata --
PARQUET:field_id: '1'
value_1: string
-- field metadata --
PARQUET:field_id: '2'
value_2: string
-- field metadata --
PARQUET:field_id: '3'
I also tried the parquet write the way we are doing currenlty
writer = pq.ParquetWriter("test", schema=schema, version="1.0")
writer.write_table(patbl)
ValueError: Table schema does not match schema used to create file:
table:
key: string
value_1: int64
value_2: string vs.
file:
key: string
-- field metadata --
PARQUET:field_id: '1'
value_1: string
-- field metadata --
PARQUET:field_id: '2'
value_2: string
-- field metadata --
PARQUET:field_id: '3
Do we do any other transformation for the schema before we write in current
write support.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]