Re: [PR] Core, Spark 4.1: Fix querying equality deletes with schema evolution [iceberg]

via GitHub Mon, 09 Feb 2026 16:03:22 -0800


amogh-jahagirdar commented on code in PR #15268:
URL: https://github.com/apache/iceberg/pull/15268#discussion_r2785123350



##########
spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/BaseReader.java:
##########
@@ -252,5 +251,40 @@ protected <V> V getOrLoad(String key, Supplier<V> 
valueSupplier, long valueSize)
         return cache.getOrLoad(table().name(), key, valueSupplier, valueSize);
       }
     }
+
+    // field lookup for serializable tables that assumes fetching historic 
schemas is expensive
+    private static class FieldLookup implements Function<Integer, 
Types.NestedField> {
+      private final Table table;
+
+      private volatile Map<Integer, Types.NestedField> historicSchemaFields;
+
+      private FieldLookup(Table table) {
+        this.table = table;
+      }
+
+      @Override
+      public Types.NestedField apply(Integer id) {
+        Types.NestedField field = table.schema().findField(id);
+        return field != null ? field : historicSchemaFields().get(id);
+      }
+
+      private Map<Integer, Types.NestedField> historicSchemaFields() {
+        if (historicSchemaFields == null) {
+          synchronized (this) {
+            if (historicSchemaFields == null) {
+              this.historicSchemaFields = 
Schema.indexFields(historicSchemas(table));
+            }
+          }
+        }
+
+        return historicSchemaFields;
+      }
+
+      private static Collection<Schema> historicSchemas(Table table) {
+        return table.schemas().values().stream()
+            .filter(schema -> schema.schemaId() != table.schema().schemaId())
+            .collect(Collectors.toList());
+      }
+    }

Review Comment:
   We've been loading the table metadata on the executor for a while at least 
for other use cases like metadata tables so we're not quite just doing it 
"now". I believe the specific concern is that currently, when loading table 
metadata from executors, executors read the table metadata JSON lazily when 
they need certain parts of table metadata that are not stored in 
SerializableTable. For use cases like server-side planning, there may not be 
any credentials to actually read the metadata json directly. 
   
   I think that's still ultimately a separate issue and given this change, I 
believe we'd see this issue specifically when there are equality deletes 
returned as part of scan planning. 
   
   I think we can look at that as needed so that executors can resolve table 
metadata from different approaches (naive approach, just broadcast the whole 
serializable table for those cases or more lazily resolve the metadata by 
issuing load table requests), not just having to read the metadata file. 
   
   Since the immediate use cases for scan planning tend to not return equality 
deletes and also the worst case is failure anyways, I don't think this is 
something to concern ourselves with for the 1.11 release but it's something we 
should fix at some point.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Core, Spark 4.1: Fix querying equality deletes with schema evolution [iceberg]

Reply via email to