spark git commit: [SPARK-6082] [SQL] Provides better error message for malformed rows when caching tables

marmbrus Mon, 02 Mar 2015 16:18:31 -0800

Repository: spark
Updated Branches:
  refs/heads/branch-1.3 3899c7c2c -> 866f2814a



[SPARK-6082] [SQL] Provides better error message for malformed rows when 
caching tables

Constructs like Hive `TRANSFORM` may generate malformed rows (via badly 
authored external scripts for example). I'm a bit hesitant to have this 
feature, since it introduces per-tuple cost when caching tables. However, 
considering caching tables is usually a one-time cost, this is probably worth 
having.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png"; height=40 alt="Review on 
Reviewable"/>](https://reviewable.io/reviews/apache/spark/4842)
<!-- Reviewable:end -->

Author: Cheng Lian <[email protected]>

Closes #4842 from liancheng/spark-6082 and squashes the following commits:

b05dbff [Cheng Lian] Provides better error message for malformed rows when 
caching tables

(cherry picked from commit 1a49496b4a9df40c74739fc0fb8a21c88a477075)
Signed-off-by: Michael Armbrust <[email protected]>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/866f2814
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/866f2814
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/866f2814

Branch: refs/heads/branch-1.3
Commit: 866f2814a48a34820da9069378c2cbbb3589fb0f
Parents: 3899c7c
Author: Cheng Lian <[email protected]>
Authored: Mon Mar 2 16:18:00 2015 -0800
Committer: Michael Armbrust <[email protected]>
Committed: Mon Mar 2 16:18:10 2015 -0800

----------------------------------------------------------------------
 .../spark/sql/columnar/InMemoryColumnarTableScan.scala   | 11 +++++++++++
 1 file changed, 11 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/866f2814/sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala
----------------------------------------------------------------------
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala
index 11d5943..8944a32 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala
@@ -119,6 +119,17 @@ private[sql] case class InMemoryRelation(
           var rowCount = 0
           while (rowIterator.hasNext && rowCount < batchSize) {
             val row = rowIterator.next()
+
+            // Added for SPARK-6082. This assertion can be useful for 
scenarios when something
+            // like Hive TRANSFORM is used. The external data generation 
script used in TRANSFORM
+            // may result malformed rows, causing 
ArrayIndexOutOfBoundsException, which is somewhat
+            // hard to decipher.
+            assert(
+              row.size == columnBuilders.size,
+              s"""Row column number mismatch, expected ${output.size} columns, 
but got ${row.size}.
+                 |Row content: $row
+               """.stripMargin)
+
             var i = 0
             while (i < row.length) {
               columnBuilders(i).appendFrom(row, i)


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

spark git commit: [SPARK-6082] [SQL] Provides better error message for malformed rows when caching tables

Reply via email to