[I] HiveCatalog client slow in some requests [iceberg]

via GitHub Tue, 21 Jan 2025 10:40:14 -0800


jotarada opened a new issue, #12024:
URL: https://github.com/apache/iceberg/issues/12024


   ### Apache Iceberg version
   
   1.4.3
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   We have this schema that contains huge amount of tables (8k+) and we notice 
timeouts when using hivecatalog iceberg impl, but spark default one is super 
fast.
   Example:
   If we run a spark session with this conf:
   
    ```
   pyspark --master yarn   
   --packages org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.4.3
   --conf 
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
   
   --conf 
spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog   
   --conf spark.sql.catalog.spark_catalog.type=hive   
   --conf spark.sql.catalog.iceberg=org.apache.iceberg.spark.SparkCatalog   
   --conf spark.sql.catalog.iceberg.type=hive
   ```
   and run `spark.sql("show tables in some_schema").show()` it takes +/- 15secs 
as we see it uses the spark impl to access hive tables. We can see that on our 
metastore logs: 
   ```
   INFO 2025-01-21T18:25:24.565000057Z map[class:HiveMetaStore.audit 
log:ugi=jorge.arada ip=10.123.123.123 cmd=source:123.123.123.123 get_database: 
some_schema thread:pool-12-thread-19]
   INFO 2025-01-21T18:25:24.565000057Z map[class:metastore.HiveMetaStore 
log:26: source:10.123.123.123 get_database: some_schema 
thread:pool-12-thread-19]
   INFO 2025-01-21T18:25:24.571000099Z map[class:metastore.HiveMetaStore 
log:26: source:10.123.123.123 get_database: some_schema 
thread:pool-12-thread-19]
   INFO 2025-01-21T18:25:24.571000099Z map[class:HiveMetaStore.audit 
log:ugi=jorge.arada ip=10.123.123.123 cmd=source:123.123.123.123 get_database: 
some_schema thread:pool-12-thread-19]
   INFO 2025-01-21T18:25:24.579999923Z map[class:HiveMetaStore.audit 
log:ugi=jorge.arada ip=123.123.123.123 cmd=source:123.123.123.123 get_tables: 
db=some_schema pat=* thread:pool-12-thread-19]
   INFO 2025-01-21T18:25:24.579999923Z map[class:metastore.HiveMetaStore 
log:26: source:123.123.123.123 get_tables: db=some_schema pat=* 
thread:pool-12-thread-19]
   ```
   
   But if we run `spark.sql("show tables in iceberg.some_schema").show()` it 
takes up to 5min and we can see in the logs a different method was called
   
   ```
   INFO 2025-01-21T18:29:49.118000030Z map[class:HiveMetaStore.audit 
log:ugi=jorge.arada ip=123.123.123.123 cmd=source:123.123.123.123 
get_all_tables: db=some_schema thread:pool-12-thread-129]
   INFO 2025-01-21T18:29:49.118000030Z map[class:metastore.HiveMetaStore 
log:135: source:123.123.123.123 get_all_tables: db=some_schema 
thread:pool-12-thread-129]
   ```
   
   Tested on spark 3.3 and 3.5
   And from what i could read on the iceberg code it seems to be the same for 
iceberg 1.7.X
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [ ] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [ ] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] HiveCatalog client slow in some requests [iceberg]

Reply via email to