Re: [PR] Spark: Support creating views via SQL [iceberg]

via GitHub Tue, 23 Jan 2024 16:18:15 -0800


rdblue commented on code in PR #9423:
URL: https://github.com/apache/iceberg/pull/9423#discussion_r1464142918



##########
spark/v3.5/spark-extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CreateV2ViewExec.scala:
##########
@@ -0,0 +1,147 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.v2
+
+import org.apache.iceberg.spark.Spark3Util
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.analysis.ViewAlreadyExistsException
+import org.apache.spark.sql.catalyst.expressions.Alias
+import org.apache.spark.sql.catalyst.expressions.Attribute
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.catalyst.plans.logical.Project
+import org.apache.spark.sql.connector.catalog.Identifier
+import org.apache.spark.sql.connector.catalog.ViewCatalog
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.execution.CommandExecutionMode
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.MetadataBuilder
+import org.apache.spark.sql.util.SchemaUtils
+import scala.collection.JavaConverters._
+
+
+case class CreateV2ViewExec(
+  catalog: ViewCatalog,
+  ident: Identifier,
+  originalText: String,
+  query: LogicalPlan,
+  userSpecifiedColumns: Seq[(String, Option[String])],
+  comment: Option[String],
+  properties: Map[String, String],
+  allowExisting: Boolean,
+  replace: Boolean) extends LeafV2CommandExec {
+
+  override lazy val output: Seq[Attribute] = Nil
+
+  override protected def run(): Seq[InternalRow] = {
+    val qe = session.sessionState.executePlan(query, CommandExecutionMode.SKIP)
+    qe.assertAnalyzed()
+    val analyzedPlan = qe.analyzed
+
+    val identifier = Spark3Util.toV1TableIdentifier(ident)
+
+    if (userSpecifiedColumns.nonEmpty) {
+      if (userSpecifiedColumns.length > analyzedPlan.output.length) {
+        throw QueryCompilationErrors.cannotCreateViewNotEnoughColumnsError(
+          identifier, userSpecifiedColumns.map(_._1), analyzedPlan)
+      } else if (userSpecifiedColumns.length < analyzedPlan.output.length) {
+        throw QueryCompilationErrors.cannotCreateViewTooManyColumnsError(
+          identifier, userSpecifiedColumns.map(_._1), analyzedPlan)
+      }
+    }
+
+    val queryColumnNames = analyzedPlan.schema.fieldNames
+    SchemaUtils.checkColumnNameDuplication(queryColumnNames, 
SQLConf.get.resolver)

Review Comment:
   After looking into this a bit more, `query` does correspond to 
`originalText`. That's good because we don't have to call the parser separately 
and this ensures that the query can be analyzed correctly. We should keep this 
design for the logical plan.
   
   I think there are a few things that we can improve, though. Since the query 
plan is linked in like this, it will be automatically analyzed so this should 
already catch things like trying to project a missing column (by the way, we 
should have a test for that). It will also catch missing relations (tables and 
other views) but we probably need to handle resolution a bit differently --- we 
want view resolution to happen just like it would when loading a view. I think 
that means that we should call our own code from `createViewRelation` to 
rewrite identifiers. (Ideally, we could alias here as well, but it may need 
some special handling if `GetColumnByOrdinal` doesn't work with an undetermined 
type.)
   
   Rewriting identifiers needs to happen immediately so that Spark doesn't 
substitute any temporary views. I think this needs to be done _before_ 
`ResolveViews` so we should do it in `RewriteViewCommands`.
   
   To summarize:
   * `RewriteViewCommands` should call our code to rewrite identifiers in the 
parsed `query` (and alias columns, if possible)
   * If `RewriteViewCommands` can't apply the aliases, then we will need a rule 
that applies them once the `query` is `resolved`
   * `RewriteViewCommands` should also check for temporary view references and 
fail if any are found (and maybe temporary functions as well?)
   * Checking for the right number of columns should be done in the checker 
(validation phase) if it isn't done already because we use `GetColumnByOrdinal` 
for aliases
   * We should drop the `query` when possible because it shouldn't be needed 
after the analysis phase. Instead we should get the schema of the query (after 
aliasing) and pass that along.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark: Support creating views via SQL [iceberg]

Reply via email to