amogh-jahagirdar commented on code in PR #15268:
URL: https://github.com/apache/iceberg/pull/15268#discussion_r2785123350
##########
spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/BaseReader.java:
##########
@@ -252,5 +251,40 @@ protected <V> V getOrLoad(String key, Supplier<V>
valueSupplier, long valueSize)
return cache.getOrLoad(table().name(), key, valueSupplier, valueSize);
}
}
+
+ // field lookup for serializable tables that assumes fetching historic
schemas is expensive
+ private static class FieldLookup implements Function<Integer,
Types.NestedField> {
+ private final Table table;
+
+ private volatile Map<Integer, Types.NestedField> historicSchemaFields;
+
+ private FieldLookup(Table table) {
+ this.table = table;
+ }
+
+ @Override
+ public Types.NestedField apply(Integer id) {
+ Types.NestedField field = table.schema().findField(id);
+ return field != null ? field : historicSchemaFields().get(id);
+ }
+
+ private Map<Integer, Types.NestedField> historicSchemaFields() {
+ if (historicSchemaFields == null) {
+ synchronized (this) {
+ if (historicSchemaFields == null) {
+ this.historicSchemaFields =
Schema.indexFields(historicSchemas(table));
+ }
+ }
+ }
+
+ return historicSchemaFields;
+ }
+
+ private static Collection<Schema> historicSchemas(Table table) {
+ return table.schemas().values().stream()
+ .filter(schema -> schema.schemaId() != table.schema().schemaId())
+ .collect(Collectors.toList());
+ }
+ }
Review Comment:
We've been loading the table metadata on the executor for a while at least
for other use cases like metadata tables so we're not quite just doing it
"now". I believe the specific concern is that currently, when loading table
metadata from executors, executors read the table metadata JSON lazily when
they need certain parts of table metadata that are not stored in
SerializableTable. For use cases like server-side planning, there may not be
any credentials to actually read the metadata json directly.
I think that's still ultimately a separate issue and given this change, I
believe we'd see this issue specifically when there are equality deletes
returned as part of scan planning.
I think we can look at that as needed so that executors can resolve table
metadata from different approaches (naive approach, just broadcast the whole
serializable table for those cases or more lazily resolve the metadata by
issuing load table requests), not just having to read the metadata file.
Since the immediate use cases for scan planning tend to not return equality
deletes and also the worst case is failure anyways, I don't think this is
something to concern ourselves with for the 1.11 release but it's something we
should fix at some point.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]