kevinjqliu commented on issue #864: URL: https://github.com/apache/iceberg-python/issues/864#issuecomment-2241733691
Thanks for the pointer! Looks like this is from the interactions between `CreateTableTransaction`, `StagedTable`, and the table updates. I found a difference in implementation between Java and Python. The `initial_change` field in table update classes is python specific. And defaults to `False` for updates in `CommitTableRequest`. ### Example I was able to capture the request objects that Spark sends to the REST catalog server. For `CREATE OR REPLACE TABLE`, spark sends multiple HTTP requests. The 2 relevant ones are: #### `POST /v1/namespaces/default/tables` Request: ``` CreateTableRequest( name='test_null_nan', location=None, schema=Schema( NestedField(field_id=0, name='idx', field_type=IntegerType(), required=False), NestedField(field_id=1, name='col_numeric', field_type=FloatType(), required=False), schema_id=0, identifier_field_ids=[] ), partition_spec=PartitionSpec(spec_id=0), write_order=None, stage_create=True, properties={'owner': 'kevinliu'} ) ``` Response: ``` LoadTableResult( metadata_location='s3://warehouse/rest/default.db/test_null_nan/metadata/00000-30f7e048-7033-4d95-a130-e5cd3683d2d1.metadata.json', metadata=TableMetadataV2( location='s3://warehouse/rest/default.db/test_null_nan', table_uuid=UUID('575c87cb-14c2-4480-b9de-d2f55f28d7d8'), last_updated_ms=1721514513906, last_column_id=2, schemas=[ Schema( NestedField(field_id=1, name='idx', field_type=IntegerType(), required=False), NestedField(field_id=2, name='col_numeric', field_type=FloatType(), required=False), schema_id=0, identifier_field_ids=[] ) ], current_schema_id=0, partition_specs=[ PartitionSpec(spec_id=0) ], default_spec_id=0, last_partition_id=999, properties={'owner': 'kevinliu'}, current_snapshot_id=None, snapshots=[], snapshot_log=[], metadata_log=[], sort_orders=[ SortOrder(order_id=0) ], default_sort_order_id=0, refs={}, format_version=2, last_sequence_number=0 ), config={'owner': 'kevinliu'} ) ``` #### `POST /v1/namespaces/default/tables/test_null_nan` ``` CommitTableRequest( identifier=TableIdentifier( namespace=Namespace(root=['default']), name='test_null_nan' ), requirements=[ AssertCreate(type='assert-create') ], updates=[ AssignUUIDUpdate( action='assign-uuid', uuid=UUID('575c87cb-14c2-4480-b9de-d2f55f28d7d8') ), UpgradeFormatVersionUpdate( action='upgrade-format-version', format_version=2 ), AddSchemaUpdate( action='add-schema', schema_=Schema( NestedField(field_id=1, name='idx', field_type=IntegerType(), required=False), NestedField(field_id=2, name='col_numeric', field_type=FloatType(), required=False), schema_id=0, identifier_field_ids=[] ), last_column_id=2, initial_change=False ), SetCurrentSchemaUpdate( action='set-current-schema', schema_id=-1 ), AddPartitionSpecUpdate( # ValueError: Partition spec with id 0 already exists: [] action='add-spec', spec=PartitionSpec(spec_id=0), initial_change=False ), SetDefaultSpecUpdate( action='set-default-spec', spec_id=-1 ), AddSortOrderUpdate( action='add-sort-order', sort_order=SortOrder(order_id=0), initial_change=False ), SetDefaultSortOrderUpdate( action='set-default-sort-order', sort_order_id=-1 ), SetLocationUpdate( action='set-location', location='s3://warehouse/rest/default.db/test_null_nan' ), SetPropertiesUpdate( action='set-properties', updates={'owner': 'kevinliu'} ), AddSnapshotUpdate( action='add-snapshot', snapshot=Snapshot( snapshot_id=5611385456663920621, parent_snapshot_id=None, sequence_number=1, timestamp_ms=1721514574649, manifest_list='s3://warehouse/rest/default.db/test_null_nan/metadata/snap-5611385456663920621-1-5904fa53-d5f7-49c5-a86b-8df022d1dbdf.avro', summary=Summary( Operation.APPEND, **{ 'spark.app.id': 'local-1721514510400', 'added-data-files': '3', 'added-records': '3', 'added-files-size': '1919', 'changed-partition-count': '1', 'total-records': '3', 'total-files-size': '1919', 'total-data-files': '3', 'total-delete-files': '0', 'total-position-deletes': '0', 'total-equality-deletes': '0' } ), schema_id=0 ) ), SetSnapshotRefUpdate( action='set-snapshot-ref', ref_name='main', type='branch', snapshot_id=5611385456663920621, max_ref_age_ms=None, max_snapshot_age_ms=None, min_snapshots_to_keep=None ) ] ) ``` Note, `initial_change` is default to `False` since this field isn't present in Spark -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org