RussellSpitzer commented on issue #12273:
URL: https://github.com/apache/iceberg/issues/12273#issuecomment-2664127254

   So I tracked this down it is a combination of some fun issues. 
   
   TL;DR - 
   We write manifests using a `partition spec` which we create whole cloth from 
the Spark Table
   we are sourcing information from. Since this is a Spark Table, the schema 
does not have the
   same  field-ids as our target table. When "snapshot-id-inheritance" is 
false, these manifests are
   rewritten and corrected.
   
   
   
   -- Details
   
   When we import to spark we end up first creating an Iceberg Spec for the 
partitioning of our Source (non-spark) table. 
   
   
https://github.com/apache/iceberg/blob/da53495bc1bb52db37cdd1ced5c2377001c9d482/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java#L542-L543
   
   This code, creates an Identity Partition Spec but uses the conversion of the 
Spark Schema into an Iceberg Schema. This means that our "spec" here has 
basically arbitrary field id's associated with each element. If the user is 
lucky and the ordering of the Source table and the Iceberg Target table are the 
same, they will still have an off-by-1 error. 
   
   
https://github.com/apache/iceberg/blob/fb657b413e2bb7f6c5e2c78465173df0426d3527/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkSchemaUtil.java#L82-L84
   
   This spec is then directly added to all of the manifests at the later part 
of the import method
   
   
https://github.com/apache/iceberg/blob/da53495bc1bb52db37cdd1ced5c2377001c9d482/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java#L805
   
   Each entity here now will have a "spec" with the wrong source ids. For more 
fun the "SpecID" is also basically always only correct if we are lucky as well 
since it always 0.
   
   
   So at it's core we have just one issue the spec we use for building 
manifests is essentially garbage and only coincidentally matches Iceberg 
tables. Now luckily for us, if "snapshot-inheritance" is disabled, before 
committing we will end up rewriting all these manifests and when we do so, the 
value for spec will come from the target Iceberg Table and not the Source Spark 
Table leading to correct manifests *as long as the 0'th spec is the right 
target*.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to