wypoon commented on PR #7469: URL: https://github.com/apache/iceberg/pull/7469#issuecomment-1535522018
The terminology gets confusing because there are two concepts of catalog here: Spark's concept of catalog (`org.apache.spark.sql.connector.catalog.TableCatalog`) and Iceberg's concept of catalog (`org.apache.iceberg.catalog.Catalog`). `org.apache.iceberg.nessie.NessieCatalog` is an `org.apache.iceberg.catalog.Catalog`. `org.apache.iceberg.spark.SparkCatalog` is an `org.apache.spark.sql.connector.catalog.TableCatalog`. I have not used Nessie and am not familiar with it so I did not know that the Nessie catalog is case sensitive. If I understand @snazy correctly, a Spark user can use `SparkCatalog` to interact with Nessie integration with Iceberg by configuring a Spark catalog, call it `nessie`, by setting `spark.sql.catalog.nessie=org.apache.iceberg.spark.SparkCatalog` and `spark.sql.catalog.nessie.catalog-impl=org.apache.iceberg.nessie.NessieCatalog`, among other confs. The `SparkCatalog` will by default wrap itself in a `CachingCatalog` with case sensitivity by default matching that of `spark.sql.caseSensitive` (which defaults to false), and this causes a problem due to Nessie's case sensitivity. Of course, the problem can be addressed by setting `spark.sql.catalog.nessie.cache.case-sensitive=false`. I agree with @szehon-ho that the simplest remedy is to introduce a default of true for `CatalogProperties.CACHE_CASE_SENSITIVE` and use that instead of using the value of `spark.sql.caseSensitive`. This restores the default behavior and allows configurability. For non-Nessie users like users of the HiveCatalog, the current behavior is actually very counterintuitive and baffling (see https://github.com/apache/iceberg/issues/7474) and a default of false would make more sense. However, I'm fine to go with false to avoid breaking existing use cases. At the end of the day, some users are going to have to use a conf to get the non-default behavior, whether Nessie users or others. (Of course, I'd prefer that our users not have to use a conf!) Incidentally, `spark.sql.caseSensitive` does not apply only to column names. It applies to database and table names too. E.g., in `org.apache.spark.sql.catalyst.catalog.SessionCatalog`, `loadTable` calls `formatDatabaseName` and `formatTableName` (which both depend of the value of `spark.sql.caseSensitive`) before calling the `ExternalCatalog` to load the table. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
