Re: [PR] feat(lance): Ensure Spark-SQL works correctly when using lance as base file format [hudi]

via GitHub Sun, 12 Apr 2026 09:29:30 -0700


rahil-c commented on code in PR #18375:
URL: https://github.com/apache/hudi/pull/18375#discussion_r3069732092



##########
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieFileFormat.java:
##########
@@ -70,6 +70,22 @@ public String getFileExtension() {
     return extension;
   }
 
+  /**
+   * Returns true if this file format requires the SPARK record type for 
reading/writing.
+   * Lance only supports the Spark-native InternalRow representation, not Avro.
+   */
+  public boolean requiresSparkRecordType() {
+    return this == LANCE;
+  }
+
+  /**

Review Comment:
   **[High]** `resolveRecordType()` fixes the I/O layer (reader/writer factory 
selection), but every test still must manually set 
`hoodie.write.record.merge.custom.implementation.classes=DefaultSparkRecordMerger`
 because the **merger** is selected at config init time before file format is 
known.
   
   This creates a usability gap — users who set `baseFileFormat=LANCE` but 
forget the merger property will hit a confusing runtime error. Could the merger 
selection logic (`HoodieRecordUtils.createRecordMerger`) also consult the 
table's base file format and auto-select a Spark merger when Lance is 
configured? That would make `resolveRecordType` a complete solution rather than 
a partial one.



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/lance/SparkLanceReaderBase.scala:
##########
@@ -76,8 +76,21 @@ class SparkLanceReaderBase(enableVectorizedReader: Boolean) 
extends SparkColumna
     val filePath = file.filePath.toString
 
     if (requiredSchema.isEmpty && partitionSchema.isEmpty) {
-      // No columns requested - return empty iterator
-      Iterator.empty
+      // No column data needed (e.g. count() on a non-partitioned table). Open 
the Lance file to
+      // get the actual row count and return that many empty rows, so Spark 
computes the correct count.
+      val countAllocator = HoodieArrowAllocator.newChildAllocator(
+        getClass.getSimpleName + "-count-" + file.filePath, 
HoodieSparkLanceReader.LANCE_DATA_ALLOCATOR_SIZE)
+      try {
+        val lanceReader = LanceFileReader.open(file.filePath.toString, 
countAllocator)
+        try {
+          val rowCount = Math.toIntExact(lanceReader.numRows())
+          Iterator.fill(rowCount)(InternalRow.empty)

Review Comment:
   **[Medium]** `Math.toIntExact(lanceReader.numRows())` will throw 
`ArithmeticException` if the file exceeds ~2.1B rows. While unlikely for a 
single file, this is a silent contract — consider using `Long` and `Iterator` 
that doesn't require materializing an `Int` count, or add a guard with a clear 
error message.
   
   Also, this opens and closes a full `LanceFileReader` per file just to get 
the row count for `count(*)`. If Lance exposes row count from file 
metadata/footer without opening the full reader, that would be more efficient 
for large tables with many small files.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(lance): Ensure Spark-SQL works correctly when using lance as base file format [hudi]

Reply via email to