[GitHub] [iceberg] maximethebault opened a new issue, #6224: Spark: regression / query failure with Iceberg 1.0.0 and UNION

GitBox Sat, 19 Nov 2022 09:48:18 -0800


maximethebault opened a new issue, #6224:
URL: https://github.com/apache/iceberg/issues/6224


   ### Apache Iceberg version
   
   1.0.0 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   After upgrading to Iceberg 1.0.0 & Spark 3.3.1 (from 0.13.x & 3.2.x), some 
of our SQL queries stopped working.
   
   We suspect it may be a Iceberg-related issue as we couldn't reproduce the 
issue with Hive tables.
   
   ### Stripped-down reproducer
   
   Set-up tables & views
   ```
   val table1 = Seq(("204")).toDF("id")
   table1.createOrReplaceTempView("table1")
   
   val table2_1 = Seq(("204")).toDF("id")
   table2_1.writeTo("dev.table2_1").using("iceberg").createOrReplace()
   
   val table2_2 = Seq(("204")).toDF("id")
   table2_2.createOrReplaceTempView("table2_2")
   
   val table2 = spark.table("dev.table2_1").union(spark.table("table2_2"))
   table2.createOrReplaceTempView("table2")
   ```
   
   Run query
   ```
   SELECT 
           u.*
       FROM 
           table1
       LEFT JOIN
           (
           SELECT 
               id
           FROM 
               table1
           LEFT JOIN
               table2
           USING(id)
           ) u 
       USING(id)
   ```
   
   Results in an exception:
   
   ```
   java.lang.IllegalArgumentException: requirement failed
     at scala.Predef$.require(Predef.scala:268)
     at 
org.apache.spark.sql.catalyst.plans.logical.View.<init>(basicLogicalOperators.scala:569)
     at 
org.apache.spark.sql.catalyst.plans.logical.View.copy(basicLogicalOperators.scala:568)
     at 
org.apache.spark.sql.catalyst.plans.logical.View.withNewChildInternal(basicLogicalOperators.scala:604)
     at 
org.apache.spark.sql.catalyst.plans.logical.View.withNewChildInternal(basicLogicalOperators.scala:565)
     at 
org.apache.spark.sql.catalyst.trees.UnaryLike.withNewChildrenInternal(TreeNode.scala:1242)
     at 
org.apache.spark.sql.catalyst.trees.UnaryLike.withNewChildrenInternal$(TreeNode.scala:1240)
     at 
org.apache.spark.sql.catalyst.plans.logical.View.withNewChildrenInternal(basicLogicalOperators.scala:565)
     at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$withNewChildren$2(TreeNode.scala:462)
     at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
     at 
org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:461)
     at 
org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.org$apache$spark$sql$catalyst$analysis$Analyzer$AddMetadataColumns$$addMetadataCol(Analyzer.scala:975)
     at 
org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.$anonfun$addMetadataCol$1(Analyzer.scala:975)
   ```
   
   ### Further investigation
   
   If I replace "USING" with classical "ON" clauses, the exception is not 
thrown.
   
   I think this issue is caused by the fact I'm mixing Iceberg & non-Iceberg 
tables in the UNION clause.
   
   If I inline table2 in the query, I get a different exception:
   
   ```
   SELECT 
       u.*
   FROM 
       table1
   LEFT JOIN
       (
       SELECT 
           id
       FROM 
           table1
       LEFT JOIN
           ((SELECT id id FROM dev.table2_1 limit 1) UNION (SELECT id FROM 
table2_2))
       USING(id)
       ) u 
   USING(id)
   ```
   
   results in:
   
   ```
   org.apache.spark.sql.AnalysisException: Union can only be performed on 
tables with the same number of columns, but the first table has 6 columns and 
the second table has 1 columns;
   'Project [id#1302]
   +- 'Project [id#1302, id#1302]
      +- 'Project [id#1302, id#998]
         +- 'Join LeftOuter, (id#998 = id#1302)
            :- SubqueryAlias table1
            :  +- View (`table1`, [id#998])
            :     +- Project [value#995 AS id#998]
            :        +- LocalRelation [value#995]
            +- 'SubqueryAlias u
               +- 'Project [id#1294, id#1302]
                  +- 'Project [id#1294, id#1302]
                     +- 'Join LeftOuter, (id#1302 = id#1294)
                        :- SubqueryAlias table1
                        :  +- View (`table1`, [id#1302])
                        :     +- Project [value#1296 AS id#1302]
                        :        +- LocalRelation [value#1296]
                        +- 'SubqueryAlias __auto_generated_subquery_name
                           +- 'Distinct
                              +- 'Union false, false
                                 :- GlobalLimit 1
                                 :  +- LocalLimit 1
                                 :     +- Project [_spec_id#1297, 
_partition#1298, _file#1299, _pos#1300L, _deleted#1301, id#1295 AS id#1294]
                                 :        +- SubqueryAlias 
spark_catalog.dev.table2_1
                                 :           +- RelationV2[id#1295, 
_spec_id#1297, _partition#1298, _file#1299, _pos#1300L, _deleted#1301] 
spark_catalog.dev.table2_1
                                 +- Project [id#1011]
                                    +- SubqueryAlias table2_2
                                       +- View (`table2_2`, [id#1011])
                                          +- Project [value#1008 AS id#1011]
                                             +- LocalRelation [value#1008]
   ```
   
   It looks like some Iceberg metadata columns are visible to Spark during the 
query analysis and I'm not sure they are supposed to.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] maximethebault opened a new issue, #6224: Spark: regression / query failure with Iceberg 1.0.0 and UNION

Reply via email to