Re: [PR] Partition Evolution [iceberg-python]

via GitHub Sun, 25 Feb 2024 19:53:43 -0800


amogh-jahagirdar commented on code in PR #245:
URL: https://github.com/apache/iceberg-python/pull/245#discussion_r1502013528



##########
tests/catalog/test_hive.py:
##########
@@ -277,7 +277,7 @@ def test_create_table(table_schema_simple: Schema, 
hive_database: HiveDatabase,
             )
         ],
         current_schema_id=0,
-        last_partition_id=1000,
+        last_partition_id=999,

Review Comment:
   @Fokko I took a look and integrated the changes, and after going back/forth 
I think I'd like to keep the changes as is.
   My rationale is that even after those changes there's some more workarounds 
that need to be done to make sure new IDs start at 1000 (after taking the 
changes directly, field IDs for non-REST will start at 1001 without any more 
changes).
   
   Now, technically it does not seem to be a hard requirement that ids need to 
start at 1000 for v2 tables. Even for v1 starting at 1000 does not seem to be a 
requirement, rather just how the Java lib was implemented.
   
   ```
   In v1, partition field IDs were not tracked, but were assigned sequentially 
starting at 1000 in the reference implementation.
   ```
   I read this as "this was how we originally implemented in Java but not 
really required so long as it's unique and increasing for new fields"
   
   All that said, I'd advocate for just following the practice of starting at 
1000 for both v1 and v2 because it's just the established model and avoids any 
confusion.
   
   On returning 999 for unpartitioned spec:
   
   As @HonahX alluded to I think that makes for a logical API, considering we 
want to start a field at 1000, the unpartitioned spec (no fields) last assigned 
field ID should be one less than that. I don't think we want to return 1000 in 
that case. This is also what Spark sets for the unpartitioned spec, and is spec 
compliant (since the spec doesn't mandate any particular IDs)
   
   I could see the argument where we just want to return None for a 
last_assigned_partition_id for this API but that just shifts the responsibility 
to other locations to make sure Ids are set correctly according to our scheme 
which ends up being more messy imo compared to just defining the API to return 
999 if the only spec is the unpartitioned spec (which is the value which would 
be set anyways).
   
   I also think it makes sense to set last-assigned-partition-id for both V1 
and V2. Even though it's optional for V1 we set the other optional fields for 
v1 metadata so it seems a bit odd to make this particular case an exception.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Partition Evolution [iceberg-python]

Reply via email to