Re: [I] I do not understand the partition error: ValueError: Could not find in old schema: 2: {field}: identity(2) [iceberg-python]

via GitHub Tue, 27 Aug 2024 00:53:45 -0700


cfrancois7 commented on issue #1100:
URL: 
https://github.com/apache/iceberg-python/issues/1100#issuecomment-2311815735


   @ndrluis 
   You are right! I tested with PyIceberg schema and it works.
   
   I remmembered why I used the Arrow schema. It is because of the typing and 
requirement alignment between the data I want to append and the expected schema.
   By using the PyIceberg schema, I need to declare also the PyArrow schema.
   I did not find parser that can do the job in one line.
   I need to build it or to rewrite the schema two times (PyIceberg + PyArrow).
   It is harder to maintain.
   
   For instance, the following code raised one error
   ```
   from pyiceberg.partitioning import DayTransform, PartitionSpec, 
PartitionField
   import pyarrow as pa
   
   ts_schema = Schema(
       NestedField(field_id=1, name="timestamp", field_type=TimestampType(), 
required=True),
       NestedField(field_id=2, name="campaign_id", field_type=IntegerType(), 
required=True),
       NestedField(field_id=3, name="temperature", field_type=FloatType(), 
required=False),
       NestedField(field_id=4, name="pressure", field_type=FloatType(), 
required=False),
       NestedField(field_id=5, name="humidity", field_type=IntegerType(), 
required=False),
       NestedField(field_id=6, name="led_0", field_type=BooleanType(), 
required=False)
   )
   
   # Define partitioning spec for campaign_ID
   ts_partition_spec = PartitionSpec(
       PartitionField(
           field_id=2,
           source_id=2,
           transform=IdentityTransform(), 
           name="campaign_id"
       )
   )
   
   ts_table = catalog.create_table_if_not_exists(
       'pieuvre.time_series',
       schema=ts_schema,
       partition_spec=ts_partition_spec,
       location = "local_s3"
   )
   
   ts_dict = {
       'timestamp': [
           datetime(2023, 1, 1, 12, 0),
           datetime(2023, 1, 1, 13, 0),
           datetime(2023, 1, 1, 15, 0),
           datetime(2023, 1, 1, 16, 0),
           datetime(2023, 1, 1, 12, 0),
           datetime(2023, 1, 1, 12, 0)
       ],
       'campaign_id': [1, 1, 1, 1, 2, 2],
       'temperature': [21.0, 21.5, 21.8, 21.0, 22.0, 24.5],
       'pressure': [1012.0, 1015.0, 1030.0, 1016.0, 1508.0, 1498.0],
       'humidity': [2, 5, 5, 5, 5, 5],
       'led_0': [0, 0, 0, 1, 0, 1]
   }
   
   
   ts_df = pa.Table.from_pylist(ts_list)
   ts_table.append(ts_df) # <= RAISES ONE ERROR, I need to pass the PyArrow 
Schema to make it works
   ```
   
   ```
   ┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
   ┃ Table field                      ┃ Dataframe field                  ┃
   ╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
   │ ❌ │ 1: timestamp: required timestamp │ 1: timestamp: optional timestamp │
   │ ❌ │ 2: campaign_id: required int     │ 2: campaign_id: optional long    │
   │ ❌ │ 3: temperature: optional float   │ 3: temperature: optional double  │
   │ ❌ │ 4: pressure: optional float      │ 4: pressure: optional double     │
   │ ❌ │ 5: humidity: optional int        │ 5: humidity: optional long       │
   │ ❌ │ 6: led_0: optional boolean       │ 6: led_0: optional long
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] I do not understand the partition error: ValueError: Could not find in old schema: 2: {field}: identity(2) [iceberg-python]

Reply via email to