cfrancois7 commented on issue #1100: URL: https://github.com/apache/iceberg-python/issues/1100#issuecomment-2311815735
@ndrluis You are right! I tested with PyIceberg schema and it works. I remmembered why I used the Arrow schema. It is because of the typing and requirement alignment between the data I want to append and the expected schema. By using the PyIceberg schema, I need to declare also the PyArrow schema. I did not find parser that can do the job in one line. I need to build it or to rewrite the schema two times (PyIceberg + PyArrow). It is harder to maintain. For instance, the following code raised one error ``` from pyiceberg.partitioning import DayTransform, PartitionSpec, PartitionField import pyarrow as pa ts_schema = Schema( NestedField(field_id=1, name="timestamp", field_type=TimestampType(), required=True), NestedField(field_id=2, name="campaign_id", field_type=IntegerType(), required=True), NestedField(field_id=3, name="temperature", field_type=FloatType(), required=False), NestedField(field_id=4, name="pressure", field_type=FloatType(), required=False), NestedField(field_id=5, name="humidity", field_type=IntegerType(), required=False), NestedField(field_id=6, name="led_0", field_type=BooleanType(), required=False) ) # Define partitioning spec for campaign_ID ts_partition_spec = PartitionSpec( PartitionField( field_id=2, source_id=2, transform=IdentityTransform(), name="campaign_id" ) ) ts_table = catalog.create_table_if_not_exists( 'pieuvre.time_series', schema=ts_schema, partition_spec=ts_partition_spec, location = "local_s3" ) ts_dict = { 'timestamp': [ datetime(2023, 1, 1, 12, 0), datetime(2023, 1, 1, 13, 0), datetime(2023, 1, 1, 15, 0), datetime(2023, 1, 1, 16, 0), datetime(2023, 1, 1, 12, 0), datetime(2023, 1, 1, 12, 0) ], 'campaign_id': [1, 1, 1, 1, 2, 2], 'temperature': [21.0, 21.5, 21.8, 21.0, 22.0, 24.5], 'pressure': [1012.0, 1015.0, 1030.0, 1016.0, 1508.0, 1498.0], 'humidity': [2, 5, 5, 5, 5, 5], 'led_0': [0, 0, 0, 1, 0, 1] } ts_df = pa.Table.from_pylist(ts_list) ts_table.append(ts_df) # <= RAISES ONE ERROR, I need to pass the PyArrow Schema to make it works ``` ``` ┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Table field ┃ Dataframe field ┃ ╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ ❌ │ 1: timestamp: required timestamp │ 1: timestamp: optional timestamp │ │ ❌ │ 2: campaign_id: required int │ 2: campaign_id: optional long │ │ ❌ │ 3: temperature: optional float │ 3: temperature: optional double │ │ ❌ │ 4: pressure: optional float │ 4: pressure: optional double │ │ ❌ │ 5: humidity: optional int │ 5: humidity: optional long │ │ ❌ │ 6: led_0: optional boolean │ 6: led_0: optional long ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org