[GitHub] [iceberg] GabeChurch opened a new issue, #6667: Spark Hive Iceberg Table Locks -- Settings Unclear in Docs Overrides Not Working

via GitHub Wed, 25 Jan 2023 12:15:00 -0800


GabeChurch opened a new issue, #6667:
URL: https://github.com/apache/iceberg/issues/6667


   ### Query engine
   
   Spark
   
   ### Question
   
   I have a situation where I need to make high(ish)-frequency writes to a 
single iceberg table in multiple Spark jobs, and multiple times per job -- I 
run into hive metastore locks leading to failures and need to fine tune lock 
timeout settings and retries. Note: My hive metastore is highly available 
postgres RDBMS backed and has significant resources + version 3.1.2 -- not my 
bottleneck. NOTE: I also have metastore.txn.timeout set to 1200 in my 
hive-site.xml used to configure my metastore. 
   
   Configuring Iceberg Hive table locks within Spark side is not exactly clear 
from the docs. It is discussed that you can use hadoopConfiguration settings 
but spark is not clearly mentioned so it's challenging for users to know if it 
would be supported within spark configurations at runtime, or must per 
persisted on disk in physical hadoop conf. 
   
https://github.com/apache/iceberg/blob/fede493d59f17ff2bfc0744b296d90bd36130386/docs/configuration.md
   I would assume that the following spark hadoop config overrides would work 
based on the docs and common sense but they do not appear to be effective. 
    `
   spark.hadoop.iceberg.hive.metadata-refresh-max-retries    60
   spark.hadoop.iceberg.hive.lock-timeout-ms                           800000
   spark.hadoop.iceberg.hive.lock-creation-timeout-ms            800000
   `
   
   Another confusing point is that we mention hadoop configurations can be 
passed in "per spark catalog". But after deep diving the docs it difficult to 
tell if it's possible to pass these hive lock hadoop config through spark 
catalog. I also spent some time looking through source code but it's still 
unclear to me if catalog hadoop overrides can make it from SparkUtil 
hadoopConfCatalogOverrides (lines 195-212) 
   
https://github.com/apache/iceberg/blob/master/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java
   to HiveTableOperation locks in 
   
https://github.com/apache/iceberg/blob/master/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java
   
   For instance my iceberg hive catalog is defined as spark.sql.catalog.iceberg 
-- so I tried the following 
   `
    spark.sql.catalog.iceberg.hadoop.iceberg.hive.metadata-refresh-max-retries  
  500
    spark.sql.catalog.iceberg.hadoop.iceberg.hive.lock-timeout-ms               
            8000000
    spark.sql.catalog.iceberg.hadoop.iceberg.hive.lock-creation-timeout-ms      
      8000000
   `
   
   I've also tried the following catalog settings regarding locking that are 
discussed in 
https://github.com/apache/iceberg/blob/fede493d59f17ff2bfc0744b296d90bd36130386/docs/configuration.md
   but they don't seem to have any impact. 
   
   `
    spark.sql.catalog.iceberg.lock.acquire-interval-ms                 6000
    spark.sql.catalog.iceberg.lock.acquire-timeout-ms                800000
   `
   
   I think it would really be worth breaking down iceberg hive table locks in 
their own section on the spark side in the general docs and I would stress that 
making power users dig through documentation to find important behavior like 
jvm locks for multithreaded writes in single spark jobs is not ideal (I'm not 
using a single multithreaded driver to write so this does not impact me). I saw 
that consensus in https://github.com/apache/iceberg/pull/2547 
   
   I'd be happy to make a detailed contribution on the doc side for Spark once 
I am able to make some meaningful progress on this. Thanks for all the hard 
work on this project!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] GabeChurch opened a new issue, #6667: Spark Hive Iceberg Table Locks -- Settings Unclear in Docs Overrides Not Working

Reply via email to