przemekd commented on PR #9029:
URL: https://github.com/apache/iceberg/pull/9029#issuecomment-1814274741

   @aokolnychyi Sure. So, the problem is shortly described in the linked issue: 
https://github.com/apache/iceberg/issues/9025 
   I am not directly creating `SerializableTable` instance in my code, this is 
what Spark does (and most likely some other query engines) when reading an 
Iceberg table. Let me explain the problem in a detailed way.
   
   Scenario:
   1. Let's assume that there is a new implementation of a custom location 
provider `CustomLocationProvider` packaged into a custom-provider.jar package.
   1. There is a Spark job that creates a new Iceberg table 
`iceberg_table_with_custom_location_provider` with custom location provider by 
setting `write.location-provider.impl` property on that table. The job provides 
an implementation of this provider (by using `--jars custom-provider.jar` 
option) and successfully writes new records into that table.
   2. Now a data analyst wants to query this table. She starts `spark-sql` and 
tries to execute a query `SELECT * FROM 
iceberg_table_with_custom_location_provider LIMIT 10`. She does not provide the 
implementation of location provider that was set and used for this table. She 
is unable to get query results because of the error:
   ```
   java.lang.IllegalArgumentException: Unable to find a constructor for 
implementation org.iceberg.custom.LocationProviders$DefaultLocationProvider of 
interface org.apache.iceberg.io.LocationProvider. Make sure the implementation 
is in classpath, and that it either has a public no-arg constructor or a 
two-arg constructor taking in the string base table location and its property 
string map.
           at 
org.apache.iceberg.LocationProviders.locationsFor(LocationProviders.java:52)
           at 
org.apache.iceberg.BaseMetastoreTableOperations.locationProvider(BaseMetastoreTableOperations.java:243)
           at org.apache.iceberg.BaseTable.locationProvider(BaseTable.java:245)
           at 
org.apache.iceberg.SerializableTable.<init>(SerializableTable.java:84)
           at 
org.apache.iceberg.spark.source.SerializableTableWithSize.<init>(SerializableTableWithSize.java:36)
           at 
org.apache.iceberg.spark.source.SerializableTableWithSize.copyOf(SerializableTableWithSize.java:48)
           at 
org.apache.iceberg.spark.source.SparkBatch.planInputPartitions(SparkBatch.java:77)
           at 
org.apache.spark.sql.execution.datasources.v2.BatchScanExec.inputPartitions$lzycompute(BatchScanExec.scala:63)
   ```
   The problem is that Spark needs to create a `SerializableTable` object for 
getting input partitions in order to read data and it cannot create it without 
a location provider implementation provided. It is because location provider 
field is eagerly populated, event if that's not needed. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to