przemekd commented on PR #9029: URL: https://github.com/apache/iceberg/pull/9029#issuecomment-1814274741
@aokolnychyi Sure. So, the problem is shortly described in the linked issue: https://github.com/apache/iceberg/issues/9025 I am not directly creating `SerializableTable` instance in my code, this is what Spark does (and most likely some other query engines) when reading an Iceberg table. Let me explain the problem in a detailed way. Scenario: 1. Let's assume that there is a new implementation of a custom location provider `CustomLocationProvider` packaged into a custom-provider.jar package. 1. There is a Spark job that creates a new Iceberg table `iceberg_table_with_custom_location_provider` with custom location provider by setting `write.location-provider.impl` property on that table. The job provides an implementation of this provider (by using `--jars custom-provider.jar` option) and successfully writes new records into that table. 2. Now a data analyst wants to query this table. She starts `spark-sql` and tries to execute a query `SELECT * FROM iceberg_table_with_custom_location_provider LIMIT 10`. She does not provide the implementation of location provider that was set and used for this table. She is unable to get query results because of the error: ``` java.lang.IllegalArgumentException: Unable to find a constructor for implementation org.iceberg.custom.LocationProviders$DefaultLocationProvider of interface org.apache.iceberg.io.LocationProvider. Make sure the implementation is in classpath, and that it either has a public no-arg constructor or a two-arg constructor taking in the string base table location and its property string map. at org.apache.iceberg.LocationProviders.locationsFor(LocationProviders.java:52) at org.apache.iceberg.BaseMetastoreTableOperations.locationProvider(BaseMetastoreTableOperations.java:243) at org.apache.iceberg.BaseTable.locationProvider(BaseTable.java:245) at org.apache.iceberg.SerializableTable.<init>(SerializableTable.java:84) at org.apache.iceberg.spark.source.SerializableTableWithSize.<init>(SerializableTableWithSize.java:36) at org.apache.iceberg.spark.source.SerializableTableWithSize.copyOf(SerializableTableWithSize.java:48) at org.apache.iceberg.spark.source.SparkBatch.planInputPartitions(SparkBatch.java:77) at org.apache.spark.sql.execution.datasources.v2.BatchScanExec.inputPartitions$lzycompute(BatchScanExec.scala:63) ``` The problem is that Spark needs to create a `SerializableTable` object for getting input partitions in order to read data and it cannot create it without a location provider implementation provided. It is because location provider field is eagerly populated, event if that's not needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org