Re: [PR] Core: lazily create locationProvider in SerializableTable [iceberg]

via GitHub Thu, 16 Nov 2023 03:36:44 -0800


przemekd commented on PR #9029:
URL: https://github.com/apache/iceberg/pull/9029#issuecomment-1814274741

@aokolnychyi Sure. So, the problem is shortly described in the linked issue:
https://github.com/apache/iceberg/issues/9025
I am not directly creating `SerializableTable` instance in my code, this is
what Spark does (and most likely some other query engines) when reading an
Iceberg table. Let me explain the problem in a detailed way.

Scenario:
1. Let's assume that there is a new implementation of a custom location
provider `CustomLocationProvider` packaged into a custom-provider.jar package.
1. There is a Spark job that creates a new Iceberg table
`iceberg_table_with_custom_location_provider` with custom location provider by
setting `write.location-provider.impl` property on that table. The job provides
an implementation of this provider (by using `--jars custom-provider.jar`
option) and successfully writes new records into that table.
2. Now a data analyst wants to query this table. She starts `spark-sql` and
tries to execute a query `SELECT * FROM
iceberg_table_with_custom_location_provider LIMIT 10`. She does not provide the
implementation of location provider that was set and used for this table. She
is unable to get query results because of the error:
```
java.lang.IllegalArgumentException: Unable to find a constructor for
implementation org.iceberg.custom.LocationProviders$DefaultLocationProvider of
interface org.apache.iceberg.io.LocationProvider. Make sure the implementation
is in classpath, and that it either has a public no-arg constructor or a
two-arg constructor taking in the string base table location and its property
string map.
at
org.apache.iceberg.LocationProviders.locationsFor(LocationProviders.java:52)
at
org.apache.iceberg.BaseMetastoreTableOperations.locationProvider(BaseMetastoreTableOperations.java:243)
at org.apache.iceberg.BaseTable.locationProvider(BaseTable.java:245)
at
org.apache.iceberg.SerializableTable.<init>(SerializableTable.java:84)
at
org.apache.iceberg.spark.source.SerializableTableWithSize.<init>(SerializableTableWithSize.java:36)
at
org.apache.iceberg.spark.source.SerializableTableWithSize.copyOf(SerializableTableWithSize.java:48)
at
org.apache.iceberg.spark.source.SparkBatch.planInputPartitions(SparkBatch.java:77)
at
org.apache.spark.sql.execution.datasources.v2.BatchScanExec.inputPartitions$lzycompute(BatchScanExec.scala:63)
```
The problem is that Spark needs to create a `SerializableTable` object for
getting input partitions in order to read data and it cannot create it without
a location provider implementation provided. It is because location provider
field is eagerly populated, event if that's not needed.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core: lazily create locationProvider in SerializableTable [iceberg]

Reply via email to