[GitHub] [iceberg] wypoon commented on pull request #7469: Core, Spark: Add configuration to control case sensitivity of CachingCatalog

via GitHub Thu, 04 May 2023 16:22:59 -0700


wypoon commented on PR #7469:
URL: https://github.com/apache/iceberg/pull/7469#issuecomment-1535522018


   The terminology gets confusing because there are two concepts of catalog 
here: Spark's concept of catalog 
(`org.apache.spark.sql.connector.catalog.TableCatalog`) and Iceberg's concept 
of catalog (`org.apache.iceberg.catalog.Catalog`). 
`org.apache.iceberg.nessie.NessieCatalog` is an 
`org.apache.iceberg.catalog.Catalog`. `org.apache.iceberg.spark.SparkCatalog` 
is an `org.apache.spark.sql.connector.catalog.TableCatalog`.
   
   I have not used Nessie and am not familiar with it so I did not know that 
the Nessie catalog is case sensitive.
   If I understand @snazy correctly, a Spark user can use `SparkCatalog` to 
interact with Nessie integration with Iceberg by configuring a Spark catalog, 
call it `nessie`, by setting 
`spark.sql.catalog.nessie=org.apache.iceberg.spark.SparkCatalog` and 
`spark.sql.catalog.nessie.catalog-impl=org.apache.iceberg.nessie.NessieCatalog`,
 among other confs. The `SparkCatalog` will by default wrap itself in a 
`CachingCatalog` with case sensitivity by default matching that of 
`spark.sql.caseSensitive` (which defaults to false), and this causes a problem 
due to Nessie's case sensitivity.
   Of course, the problem can be addressed by setting 
`spark.sql.catalog.nessie.cache.case-sensitive=false`.
   
   I agree with @szehon-ho that the simplest remedy is to introduce a default 
of true for `CatalogProperties.CACHE_CASE_SENSITIVE` and use that instead of 
using the value of `spark.sql.caseSensitive`. This restores the default 
behavior and allows configurability.
   For non-Nessie users like users of the HiveCatalog, the current behavior is 
actually very counterintuitive and baffling (see 
https://github.com/apache/iceberg/issues/7474) and a default of false would 
make more sense. However, I'm fine to go with false to avoid breaking existing 
use cases. At the end of the day, some users are going to have to use a conf to 
get the non-default behavior, whether Nessie users or others. (Of course, I'd 
prefer that our users not have to use a conf!)
   
   Incidentally, `spark.sql.caseSensitive` does not apply only to column names. 
It applies to database and table names too. E.g., in 
`org.apache.spark.sql.catalyst.catalog.SessionCatalog`, `loadTable` calls 
`formatDatabaseName` and `formatTableName` (which both depend of the value of 
`spark.sql.caseSensitive`) before calling the `ExternalCatalog` to load the 
table.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] wypoon commented on pull request #7469: Core, Spark: Add configuration to control case sensitivity of CachingCatalog

Reply via email to