RussellSpitzer commented on issue #12273: URL: https://github.com/apache/iceberg/issues/12273#issuecomment-2664127254
So I tracked this down it is a combination of some fun issues. TL;DR - We write manifests using a `partition spec` which we create whole cloth from the Spark Table we are sourcing information from. Since this is a Spark Table, the schema does not have the same field-ids as our target table. When "snapshot-id-inheritance" is false, these manifests are rewritten and corrected. -- Details When we import to spark we end up first creating an Iceberg Spec for the partitioning of our Source (non-spark) table. https://github.com/apache/iceberg/blob/da53495bc1bb52db37cdd1ced5c2377001c9d482/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java#L542-L543 This code, creates an Identity Partition Spec but uses the conversion of the Spark Schema into an Iceberg Schema. This means that our "spec" here has basically arbitrary field id's associated with each element. If the user is lucky and the ordering of the Source table and the Iceberg Target table are the same, they will still have an off-by-1 error. https://github.com/apache/iceberg/blob/fb657b413e2bb7f6c5e2c78465173df0426d3527/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkSchemaUtil.java#L82-L84 This spec is then directly added to all of the manifests at the later part of the import method https://github.com/apache/iceberg/blob/da53495bc1bb52db37cdd1ced5c2377001c9d482/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java#L805 Each entity here now will have a "spec" with the wrong source ids. For more fun the "SpecID" is also basically always only correct if we are lucky as well since it always 0. So at it's core we have just one issue the spec we use for building manifests is essentially garbage and only coincidentally matches Iceberg tables. Now luckily for us, if "snapshot-inheritance" is disabled, before committing we will end up rewriting all these manifests and when we do so, the value for spec will come from the target Iceberg Table and not the Source Spark Table leading to correct manifests *as long as the 0'th spec is the right target*. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org