Vino1016 commented on issue #15215:
URL: https://github.com/apache/iceberg/issues/15215#issuecomment-3882477610
I found the root reason:
Root Cause Analysis
Flink Serialization Causes Schema Object Reference Invalidation
Upstream Operator: Although reusing the same Schema instance
Serialization Transfer: DynamicRecord is serialized and sent to downstream
Writer
Downstream Writer: Each parallel instance gets a new Schema object after
deserialization
Cache Invalidation: Iceberg's TableMetadataCache uses object references as
keys, resulting in cache misses
Solution
Cache Schema objects in DynamicRecordGenerator:
JAVA
// Each Writer instance maintains its own Schema cache
private transient Schema cachedSchema;
// All records reuse the same object reference after compare
if (!icebergSchema.sameSchema(dynamicRecord.schema()))
dynamicRecord.setSchema(cachedSchema);
Results
✅ Performance warnings eliminated
✅ Cache hit rate 100%
✅ Supports Schema evolution detection
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]