[GitHub] [iceberg] aokolnychyi commented on issue #7822: CDC data inconsistencies with schema changes

via GitHub Mon, 10 Jul 2023 13:30:32 -0700


aokolnychyi commented on issue #7822:
URL: https://github.com/apache/iceberg/issues/7822#issuecomment-1629687495


   I feel like it is important to note that Iceberg tracks columns by ID and 
the latest assigned column ID is being tracked in the metadata. 
   
   ```
   spark.sql("ALTER TABLE s3_catalog.cdc.test DROP COLUMN name")
   spark.sql('ALTER TABLE s3_catalog.cdc.test ADD COLUMN name string')
   ```
   
   The snippet above does NOT restore the original column. Instead, it adds a 
brand-new column with a fresh ID that happens to have the same name as one of 
the columns that were dropped earlier. That's why the CDC procedure, which acts 
according to the current schema, outputs nulls for `name` in old columns. From 
the Iceberg perspective, original files from the first snapshot did not have 
`name` as the column ID was different.
   
   This behavior is intentional to avoid surprises when a column is dropped and 
a new one added a few years after and suddenly old data starts to appear in the 
query output. It is possible to restore an old column but it is not done via 
ALTER TABLE ADD/DROP COLUMN.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] aokolnychyi commented on issue #7822: CDC data inconsistencies with schema changes

Reply via email to