[GitHub] [iceberg] Blake-Guo opened a new issue, #6613: Multiple SparkSessions interact with Iceberg Table

GitBox Tue, 17 Jan 2023 12:51:00 -0800


Blake-Guo opened a new issue, #6613:
URL: https://github.com/apache/iceberg/issues/6613


   ### Apache Iceberg version
   
   0.12.1
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   I explored the multiple SparkSessions (to connect to different data 
sources/data clusters) to load the Iceberg Table a bit. And I found a wired 
behavior.
   
   If I use the **new** SparkSession (with some incorrect parameters like 
`spark.sql.catalog.mycatalog.uri`) to access the table created by the previous 
SparkSession through (1) `spark.read().*.load("*")`, and then try (2) running 
some SQL on that table as well, everything still works(even with the incorrect 
parameter). 
   
   The full test is given as below:
   
   ```
     @Test
     public void multipleSparkSessions() throws AnalysisException {
       // Create the 1st SparkSession
       String endpoint = String.format("http://localhost:%s/metastore";, port);
   
       ctx = SparkSession
           .builder()
           .master("local")
           .config("spark.ui.enabled", false)
           .config("spark.sql.catalog.mycatalog", 
"org.apache.iceberg.spark.SparkCatalog")
           .config("spark.sql.catalog.mycatalog.type", "hive")
           .config("spark.sql.catalog.mycatalog.uri", endpoint)
           .config("spark.sql.catalog.mycatalog.cache-enabled", "false")
           .config("spark.sql.sources.partitionOverwriteMode", "dynamic")
           .config("spark.sql.extensions", 
"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
           .getOrCreate();
   
       // Create a table with the SparkSession
       String tableName = String.format("%s.%s", "test", 
Integer.toHexString(RANDOM.nextInt()));
       ctx.sql(String.format("CREATE TABLE mycatalog.%s USING iceberg "
           + "AS SELECT * FROM VALUES ('michael', 31), ('david', 45) AS (name, 
age)", tableName));
   
   
       // Create a new SparkSession
       SparkSession newSession = ctx.newSession();
       newSession.conf().set("spark.sql.catalog.mycatalog.uri", 
"http://non_exist_address";);
   
       // Access the created dataset above with the new SparkSession through 
session.read()...load()
       List<Row> dataset2 = newSession.read()
           .format("iceberg")
           .load(String.format("mycatalog.%s", tableName)).collectAsList();
       dataset2.forEach(r -> System.out.println(r));
   
       // Access the dataset through SQL
       newSession.sql(
           String.format("select * from mycatalog.%s", 
tableName)).collectAsList();
     }
   ```
   
   But if I use the new SparkSession to access the table through (1) 
`newSession.sql` first, the execution fails, and then (2) the 
`read().**.load("**")` will fail as well with error 
`java.lang.RuntimeException: Failed to get table info from metastore 
test.3d79f679`.
   
   IMO this makes more sense, given I provided the incorrect catalog uri, so 
the SparkSession shouldn't be able to locate that table.
   
   
   ```
     @Test
     public void multipleSparkSessions() throws AnalysisException {
       ..same as above...
   
   
       // Access the dataset through SQL first
       assertThrows(java.lang.RuntimeException.class,() -> newSession.sql(
           String.format("select * from mycatalog.%s", 
tableName)).collectAsList());
   
       // Access the created dataset above with the new SparkSession through 
session.read()...load()
       assertThrows(java.lang.RuntimeException.class,() -> newSession.read()
           .format("iceberg")
           .load(String.format("mycatalog.%s", tableName)).collectAsList());
     }
   
   
   ```
   
   Any idea what could lead to these two different behaviors with 
`spark.read().load()` versus `spark.sql()` in different sequences?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] Blake-Guo opened a new issue, #6613: Multiple SparkSessions interact with Iceberg Table

Reply via email to