kevinjqliu commented on issue #864:
URL: https://github.com/apache/iceberg-python/issues/864#issuecomment-2241733691

   Thanks for the pointer! Looks like this is from the interactions between 
`CreateTableTransaction`, `StagedTable`, and the table updates. 
   I found a difference in implementation between Java and Python. The 
`initial_change` field in table update classes is python specific. And defaults 
to `False` for updates in `CommitTableRequest`. 
   
   ### Example
   I was able to capture the request objects that Spark sends to the REST 
catalog server. 
   
   For `CREATE OR REPLACE TABLE`, spark sends multiple HTTP requests. The 2 
relevant ones are:
   #### `POST /v1/namespaces/default/tables`
   Request: 
   ```
   CreateTableRequest(
       name='test_null_nan', 
       location=None, 
       schema=Schema(
           NestedField(field_id=0, name='idx', field_type=IntegerType(), 
required=False), 
           NestedField(field_id=1, name='col_numeric', field_type=FloatType(), 
required=False), 
           schema_id=0, 
           identifier_field_ids=[]
       ), 
       partition_spec=PartitionSpec(spec_id=0), 
       write_order=None, 
       stage_create=True, 
       properties={'owner': 'kevinliu'}
   )
   ```
   
   Response:
   ```
   LoadTableResult(
       
metadata_location='s3://warehouse/rest/default.db/test_null_nan/metadata/00000-30f7e048-7033-4d95-a130-e5cd3683d2d1.metadata.json',
       metadata=TableMetadataV2(
           location='s3://warehouse/rest/default.db/test_null_nan',
           table_uuid=UUID('575c87cb-14c2-4480-b9de-d2f55f28d7d8'),
           last_updated_ms=1721514513906,
           last_column_id=2,
           schemas=[
               Schema(
                   NestedField(field_id=1, name='idx', 
field_type=IntegerType(), required=False),
                   NestedField(field_id=2, name='col_numeric', 
field_type=FloatType(), required=False),
                   schema_id=0,
                   identifier_field_ids=[]
               )
           ],
           current_schema_id=0,
           partition_specs=[
               PartitionSpec(spec_id=0)
           ],
           default_spec_id=0,
           last_partition_id=999,
           properties={'owner': 'kevinliu'},
           current_snapshot_id=None,
           snapshots=[],
           snapshot_log=[],
           metadata_log=[],
           sort_orders=[
               SortOrder(order_id=0)
           ],
           default_sort_order_id=0,
           refs={},
           format_version=2,
           last_sequence_number=0
       ),
       config={'owner': 'kevinliu'}
   )
   ```
   
   #### `POST /v1/namespaces/default/tables/test_null_nan` 
   ```
   CommitTableRequest(
       identifier=TableIdentifier(
           namespace=Namespace(root=['default']),
           name='test_null_nan'
       ),
       requirements=[
           AssertCreate(type='assert-create')
       ],
       updates=[
           AssignUUIDUpdate(
               action='assign-uuid',
               uuid=UUID('575c87cb-14c2-4480-b9de-d2f55f28d7d8')
           ),
           UpgradeFormatVersionUpdate(
               action='upgrade-format-version',
               format_version=2
           ),
           AddSchemaUpdate(
               action='add-schema',
               schema_=Schema(
                   NestedField(field_id=1, name='idx', 
field_type=IntegerType(), required=False),
                   NestedField(field_id=2, name='col_numeric', 
field_type=FloatType(), required=False),
                   schema_id=0,
                   identifier_field_ids=[]
               ),
               last_column_id=2,
               initial_change=False
           ),
           SetCurrentSchemaUpdate(
               action='set-current-schema',
               schema_id=-1
           ),
           AddPartitionSpecUpdate( # ValueError: Partition spec with id 0 
already exists: []
               action='add-spec',
               spec=PartitionSpec(spec_id=0),
               initial_change=False
           ),
           SetDefaultSpecUpdate(
               action='set-default-spec',
               spec_id=-1
           ),
           AddSortOrderUpdate(
               action='add-sort-order',
               sort_order=SortOrder(order_id=0),
               initial_change=False
           ),
           SetDefaultSortOrderUpdate(
               action='set-default-sort-order',
               sort_order_id=-1
           ),
           SetLocationUpdate(
               action='set-location',
               location='s3://warehouse/rest/default.db/test_null_nan'
           ),
           SetPropertiesUpdate(
               action='set-properties',
               updates={'owner': 'kevinliu'}
           ),
           AddSnapshotUpdate(
               action='add-snapshot',
               snapshot=Snapshot(
                   snapshot_id=5611385456663920621,
                   parent_snapshot_id=None,
                   sequence_number=1,
                   timestamp_ms=1721514574649,
                   
manifest_list='s3://warehouse/rest/default.db/test_null_nan/metadata/snap-5611385456663920621-1-5904fa53-d5f7-49c5-a86b-8df022d1dbdf.avro',
                   summary=Summary(
                       Operation.APPEND,
                       **{
                           'spark.app.id': 'local-1721514510400',
                           'added-data-files': '3',
                           'added-records': '3',
                           'added-files-size': '1919',
                           'changed-partition-count': '1',
                           'total-records': '3',
                           'total-files-size': '1919',
                           'total-data-files': '3',
                           'total-delete-files': '0',
                           'total-position-deletes': '0',
                           'total-equality-deletes': '0'
                       }
                   ),
                   schema_id=0
               )
           ),
           SetSnapshotRefUpdate(
               action='set-snapshot-ref',
               ref_name='main',
               type='branch',
               snapshot_id=5611385456663920621,
               max_ref_age_ms=None,
               max_snapshot_age_ms=None,
               min_snapshots_to_keep=None
           )
       ]
   )
   ```
   Note, `initial_change` is default to `False` since this field isn't present 
in Spark
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to