rahil-c commented on code in PR #18375:
URL: https://github.com/apache/hudi/pull/18375#discussion_r3069732092
##########
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieFileFormat.java:
##########
@@ -70,6 +70,22 @@ public String getFileExtension() {
return extension;
}
+ /**
+ * Returns true if this file format requires the SPARK record type for
reading/writing.
+ * Lance only supports the Spark-native InternalRow representation, not Avro.
+ */
+ public boolean requiresSparkRecordType() {
+ return this == LANCE;
+ }
+
+ /**
Review Comment:
**[High]** `resolveRecordType()` fixes the I/O layer (reader/writer factory
selection), but every test still must manually set
`hoodie.write.record.merge.custom.implementation.classes=DefaultSparkRecordMerger`
because the **merger** is selected at config init time before file format is
known.
This creates a usability gap — users who set `baseFileFormat=LANCE` but
forget the merger property will hit a confusing runtime error. Could the merger
selection logic (`HoodieRecordUtils.createRecordMerger`) also consult the
table's base file format and auto-select a Spark merger when Lance is
configured? That would make `resolveRecordType` a complete solution rather than
a partial one.
##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/lance/SparkLanceReaderBase.scala:
##########
@@ -76,8 +76,21 @@ class SparkLanceReaderBase(enableVectorizedReader: Boolean)
extends SparkColumna
val filePath = file.filePath.toString
if (requiredSchema.isEmpty && partitionSchema.isEmpty) {
- // No columns requested - return empty iterator
- Iterator.empty
+ // No column data needed (e.g. count() on a non-partitioned table). Open
the Lance file to
+ // get the actual row count and return that many empty rows, so Spark
computes the correct count.
+ val countAllocator = HoodieArrowAllocator.newChildAllocator(
+ getClass.getSimpleName + "-count-" + file.filePath,
HoodieSparkLanceReader.LANCE_DATA_ALLOCATOR_SIZE)
+ try {
+ val lanceReader = LanceFileReader.open(file.filePath.toString,
countAllocator)
+ try {
+ val rowCount = Math.toIntExact(lanceReader.numRows())
+ Iterator.fill(rowCount)(InternalRow.empty)
Review Comment:
**[Medium]** `Math.toIntExact(lanceReader.numRows())` will throw
`ArithmeticException` if the file exceeds ~2.1B rows. While unlikely for a
single file, this is a silent contract — consider using `Long` and `Iterator`
that doesn't require materializing an `Int` count, or add a guard with a clear
error message.
Also, this opens and closes a full `LanceFileReader` per file just to get
the row count for `count(*)`. If Lance exposes row count from file
metadata/footer without opening the full reader, that would be more efficient
for large tables with many small files.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]