yihua commented on code in PR #18455: URL: https://github.com/apache/hudi/pull/18455#discussion_r3082898156
########## hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/GlutenTestUtils.java: ########## @@ -0,0 +1,68 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.testutils; + +import org.apache.spark.SparkConf; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * Utility for injecting Gluten/Velox native execution into a test {@link SparkConf}. + * + * <p>Activated by passing {@code -Dgluten.bundle.jar=<path>} at test time. + * If {@code ai.onehouse.quanton.QuantonPlugin} is on the classpath it is preferred; + * otherwise {@code org.apache.gluten.GlutenPlugin} is used. + */ +public class GlutenTestUtils { + + private static final Logger LOG = LoggerFactory.getLogger(GlutenTestUtils.class); + + private GlutenTestUtils() {} + + /** + * Applies Gluten/Velox native-execution settings to {@code sparkConf} when the + * {@code gluten.bundle.jar} system property is set. No-op otherwise. + */ + public static void applyGlutenConf(SparkConf sparkConf) { + String glutenBundleJar = System.getProperty("gluten.bundle.jar"); + if (glutenBundleJar == null || glutenBundleJar.isEmpty()) { + return; + } + + String pluginClass; + try { + Class.forName("ai.onehouse.quanton.QuantonPlugin"); + pluginClass = "ai.onehouse.quanton.QuantonPlugin"; + } catch (ClassNotFoundException e) { + pluginClass = "org.apache.gluten.GlutenPlugin"; + } Review Comment: _⚠️ Potential issue_ | _🟠 Major_ <details> <summary>🧩 Analysis chain</summary> 🏁 Script executed: ```shell fd -t f "GlutenTestUtils.java" ``` Repository: yihua/hudi Length of output: 144 --- 🏁 Script executed: ```shell cat -n hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/GlutenTestUtils.java ``` Repository: yihua/hudi Length of output: 3358 --- 🏁 Script executed: ```shell # Search for similar validation patterns in the codebase for system properties rg "System.getProperty" --type java -A 5 -B 2 | head -100 ``` Repository: yihua/hudi Length of output: 13994 --- 🏁 Script executed: ```shell # Check if there are any tests or usages of applyGlutenConf rg "applyGlutenConf" --type java -B 2 -A 5 ``` Repository: yihua/hudi Length of output: 3052 --- </details> **Add fail-fast validation for `gluten.bundle.jar` property and fallback plugin class.** The method currently treats `gluten.bundle.jar` as a boolean flag only. A typo in the path or missing classpath wiring still configures SparkConf and defers the failure to Spark startup, making test diagnostics harder. Validate the file path and ensure the fallback `GlutenPlugin` class exists before enabling the plugin mode. <details> <summary>Suggested fail-fast check</summary> ```diff +import java.io.File; + public static void applyGlutenConf(SparkConf sparkConf) { String glutenBundleJar = System.getProperty("gluten.bundle.jar"); if (glutenBundleJar == null || glutenBundleJar.isEmpty()) { return; } + if (!new File(glutenBundleJar).isFile()) { + throw new IllegalArgumentException("gluten.bundle.jar does not point to an existing file: " + glutenBundleJar); + } String pluginClass; try { Class.forName("ai.onehouse.quanton.QuantonPlugin"); pluginClass = "ai.onehouse.quanton.QuantonPlugin"; } catch (ClassNotFoundException e) { - pluginClass = "org.apache.gluten.GlutenPlugin"; + try { + Class.forName("org.apache.gluten.GlutenPlugin"); + pluginClass = "org.apache.gluten.GlutenPlugin"; + } catch (ClassNotFoundException e2) { + throw new IllegalStateException( + "gluten.bundle.jar is set but neither Quanton nor Gluten plugin is on the test classpath", + e2); + } } ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/GlutenTestUtils.java` around lines 42 - 54, The applyGlutenConf method currently treats the gluten.bundle.jar string as a flag; update it to fail-fast by validating the path and plugin class before mutating SparkConf: 1) check that the glutenBundleJar string is non-empty and points to an existing readable file (e.g., new File(glutenBundleJar).isFile()) and if not, throw an IllegalArgumentException or return early so SparkConf is not modified; 2) resolve the plugin class by first trying Class.forName("ai.onehouse.quanton.QuantonPlugin") and if that fails set pluginClass = "org.apache.gluten.GlutenPlugin", then also verify the fallback exists via Class.forName(pluginClass) and return/throw if it does not; 3) only after both the jar file and the chosen pluginClass are validated, set the SparkConf entries (the code that writes to sparkConf) so misconfigured paths or missing classes are detected immediately in applyGlutenConf. ``` </details> <!-- fingerprinting:phantom:medusa:grasshopper:163599a9-90f2-46bb-9a83-b92fc3dd8804 --> <!-- This is an auto-generated comment by CodeRabbit --> — *CodeRabbit* ([original](https://github.com/yihua/hudi/pull/39#discussion_r3082895725)) (source:comment#3082895725) ########## pom.xml: ########## @@ -170,7 +170,7 @@ <rocksdbjni.version>7.5.3</rocksdbjni.version> Review Comment: <a href="#"><img alt="P1" src="https://greptile-static-assets.s3.amazonaws.com/badges/p1.svg?v=7" align="top"></a> **Internal Onehouse Spark build version in OSS project** The Spark 3.5 version has been changed from the public `3.5.5` to `3.5.2-INTERNAL-0.16`, an Onehouse-internal build that is only resolvable via the private CodeArtifact repository added below (line ~1966). This will break the build for every developer and CI environment that does not have access to that repository. This must be reverted to an official Apache Spark release before merging to the OSS project. — *Greptile* ([original](https://github.com/yihua/hudi/pull/39#discussion_r3082852612)) (source:comment#3082852612) ########## hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala: ########## @@ -103,6 +103,7 @@ object HoodieSparkUtils extends SparkAdapterSupport with SparkVersionsSupport wi // Additionally, we have to explicitly wrap around resulting [[RDD]] into the one // injecting [[SQLConf]], which by default isn't propagated by Spark to the executor(s). // [[SQLConf]] is required by [[AvroSerializer]] + logWarning(s"createRdd executedPlan:\n${df.queryExecution.executedPlan.treeString}") Review Comment: <a href="#"><img alt="P1" src="https://greptile-static-assets.s3.amazonaws.com/badges/p1.svg?v=7" align="top"></a> **Debug warning log left in production code** This `logWarning` call prints the entire executed query plan (as a tree string) at WARN level every time `createRdd` is called. This is clearly a debug statement added during the Gluten/Velox investigation and must not reach production: - It will spam the logs for every MOR read in every Hudi job. - `treeString` is non-trivial to compute and allocates for every partition scan. - WARN level means it will appear in most production logging configurations. This line should be removed entirely before merging. — *Greptile* ([original](https://github.com/yihua/hudi/pull/39#discussion_r3082852557)) (source:comment#3082852557) ########## hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala: ########## @@ -103,6 +103,7 @@ object HoodieSparkUtils extends SparkAdapterSupport with SparkVersionsSupport wi // Additionally, we have to explicitly wrap around resulting [[RDD]] into the one // injecting [[SQLConf]], which by default isn't propagated by Spark to the executor(s). // [[SQLConf]] is required by [[AvroSerializer]] + logWarning(s"createRdd executedPlan:\n${df.queryExecution.executedPlan.treeString}") Review Comment: _⚠️ Potential issue_ | _🟠 Major_ **Avoid warning-level plan dumps on the `createRdd` hot path.** `createRdd` is on the write path, and `executedPlan.treeString` eagerly materializes the full plan string. Logging that at `WARN` for every call will add a lot of noise and overhead in production. Please keep this behind `logDebug` or a dedicated diagnostic flag instead. <details> <summary>Suggested change</summary> ```diff - logWarning(s"createRdd executedPlan:\n${df.queryExecution.executedPlan.treeString}") + logDebug(s"createRdd executedPlan:\n${df.queryExecution.executedPlan.treeString}") ``` </details> <!-- suggestion_start --> <details> <summary>📝 Committable suggestion</summary> > ‼️ **IMPORTANT** > Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements. ```suggestion logDebug(s"createRdd executedPlan:\n${df.queryExecution.executedPlan.treeString}") ``` </details> <!-- suggestion_end --> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala` at line 106, The current logWarning in HoodieSparkUtils.createRdd eagerly materializes df.queryExecution.executedPlan.treeString on a hot write path; change it to avoid WARN-level plan dumps by either moving the message to logDebug or gating it behind a diagnostic flag. Locate the logWarning call referencing df.queryExecution.executedPlan.treeString in createRdd and replace it so the plan string is only generated when logDebug is enabled or when a new boolean diagnostic flag (e.g., enablePlanDump) is true; ensure you check logger.isDebugEnabled (or the flag) before calling treeString to prevent unnecessary work. ``` </details> <!-- fingerprinting:phantom:medusa:grasshopper:79e7aa6f-65c0-490f-bb56-4d8aab80feac --> <!-- This is an auto-generated comment by CodeRabbit --> — *CodeRabbit* ([original](https://github.com/yihua/hudi/pull/39#discussion_r3082895711)) (source:comment#3082895711) ########## RCA_Gluten_Velox_MOR_Failures.md: ########## @@ -0,0 +1,373 @@ +# RCA: Gluten/Velox MOR Test Failures + +Three sub-tests fail when Gluten/Velox native execution is enabled: +`testMORWithMap`, `testMORWithDecimal`, `testMORWithArray`. + Review Comment: <a href="#"><img alt="P2" src="https://greptile-static-assets.s3.amazonaws.com/badges/p2.svg?v=7" align="top"></a> **Internal investigation document should not be committed to the repository** `RCA_Gluten_Velox_MOR_Failures.md` is a root-cause analysis document for a specific Gluten/Velox testing investigation. It references internal stack traces, log evidence, and Onehouse-specific infrastructure. It is not part of the project's documentation and will confuse future readers of the repository. This file should not be committed to the OSS repository. If it needs to be preserved for reference, it belongs in an internal wiki or issue tracker rather than the source tree. — *Greptile* ([original](https://github.com/yihua/hudi/pull/39#discussion_r3082853179)) (source:comment#3082853179) ########## hudi-common/src/test/java/org/apache/hudi/common/table/read/buffer/TestKeyBasedFileGroupRecordBuffer.java: ########## @@ -176,6 +178,7 @@ void readWithEventTimeOrderingWithRecords() throws IOException { assertEquals(3, readStats.getNumUpdates()); } + @Disabled("Custom delete payload not supported") @Test void readWithCommitTimeOrdering() throws IOException { Review Comment: _⚠️ Potential issue_ | _🟡 Minor_ **Disabled reason may be misleading for test name.** The test `readWithCommitTimeOrdering` is disabled with "Custom delete payload not supported", which is technically correct (the test uses custom delete markers), but the test name suggests it's primarily about commit time ordering behavior. Consider clarifying the disable reason or renaming the test if custom delete payload support is the actual blocker. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@hudi-common/src/test/java/org/apache/hudi/common/table/read/buffer/TestKeyBasedFileGroupRecordBuffer.java` around lines 181 - 183, The `@Disabled` reason on the test method readWithCommitTimeOrdering is misleading; update the disable annotation or the test name so the intent is clear: either change the `@Disabled` message to explicitly state "Uses custom delete payload (not supported)" or rename the method (e.g., to readWithCommitTimeOrderingAndCustomDeletePayload) to reflect that the failure is due to custom delete payloads, not commit-time ordering. Locate the test method readWithCommitTimeOrdering and apply one of these changes so the disable reason and test identifier align. ``` </details> <!-- fingerprinting:phantom:medusa:ocelot --> <!-- This is an auto-generated comment by CodeRabbit --> — *CodeRabbit* ([original](https://github.com/yihua/hudi/pull/39#discussion_r3082895734)) (source:comment#3082895734) ########## pom.xml: ########## @@ -1963,6 +1963,11 @@ <id>confluent</id> <url>https://packages.confluent.io/maven/</url> </repository> + <repository> + <id>codeartifact</id> + <name>codeartifact</name> Review Comment: <a href="#"><img alt="P1" src="https://greptile-static-assets.s3.amazonaws.com/badges/p1.svg?v=7" align="top"></a> **Private internal Maven repository in OSS project** The repository at `https://onehouse-194159489498.d.codeartifact.us-west-2.amazonaws.com/maven/onehouse-internal/` is a private AWS CodeArtifact instance owned by Onehouse. Adding it to the root `pom.xml` means every Maven build of the project will attempt to resolve artifacts from a private, unauthenticated endpoint — this will fail for all external contributors and CI systems outside of Onehouse infrastructure. Both this repository entry and the internal Spark version (`3.5.2-INTERNAL-0.16`) that depends on it must be removed. — *Greptile* ([original](https://github.com/yihua/hudi/pull/39#discussion_r3082852660)) (source:comment#3082852660) ########## hudi-common/src/test/java/org/apache/hudi/common/table/read/TestHoodieFileGroupReaderBase.java: ########## @@ -449,8 +449,7 @@ public void testReadFileGroupWithMultipleOrderingFields() throws Exception { private static Stream<Arguments> logFileOnlyCases() { return Stream.of( arguments(RecordMergeMode.COMMIT_TIME_ORDERING, "avro"), - arguments(RecordMergeMode.EVENT_TIME_ORDERING, "parquet"), - arguments(RecordMergeMode.CUSTOM, "avro")); + arguments(RecordMergeMode.EVENT_TIME_ORDERING, "avro")); Review Comment: _⚠️ Potential issue_ | _🟡 Minor_ **Log file only test cases reduced to avro format only.** `logFileOnlyCases()` now only tests `avro` log data block format, removing `parquet` format coverage. If parquet log blocks are supported in production, this reduces test coverage. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@hudi-common/src/test/java/org/apache/hudi/common/table/read/TestHoodieFileGroupReaderBase.java` around lines 451 - 452, The test parameterization was narrowed to only the "avro" log data block format, removing "parquet" coverage; update the parameter list used by the test (in TestHoodieFileGroupReaderBase, the method supplying arguments for logFileOnlyCases) to include both "avro" and "parquet" again (e.g., add corresponding arguments(RecordMergeMode.COMMIT_TIME_ORDERING, "parquet") and arguments(RecordMergeMode.EVENT_TIME_ORDERING, "parquet") so the test runs for both formats). ``` </details> <!-- fingerprinting:phantom:medusa:ocelot --> <!-- This is an auto-generated comment by CodeRabbit --> — *CodeRabbit* ([original](https://github.com/yihua/hudi/pull/39#discussion_r3082895741)) (source:comment#3082895741) ########## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala: ########## @@ -1491,8 +1487,8 @@ class TestMORDataSource extends HoodieSparkClientTestBase with SparkDatasetMixin var (_, readOpts) = getWriterReaderOpts(readType) var (writeOpts, _) = getWriterReaderOpts(writeType) - readOpts = readOpts ++ Map(HoodieStorageConfig.LOGFILE_DATA_BLOCK_FORMAT.key -> "parquet") - writeOpts = writeOpts ++ Map(HoodieStorageConfig.LOGFILE_DATA_BLOCK_FORMAT.key -> "parquet") + readOpts = readOpts ++ Map(HoodieStorageConfig.LOGFILE_DATA_BLOCK_FORMAT.key -> "avro") + writeOpts = writeOpts ++ Map(HoodieStorageConfig.LOGFILE_DATA_BLOCK_FORMAT.key -> "avro") Review Comment: _⚠️ Potential issue_ | _🟠 Major_ **This test no longer exercises parquet log blocks.** `testRecordTypeCompatibilityWithParquetLog` now forces `LOGFILE_DATA_BLOCK_FORMAT=avro`, so it will stay green even if the parquet-log reader regresses. Keep this test on `parquet` or rename/split it. <details> <summary>Minimal fix</summary> ```diff - readOpts = readOpts ++ Map(HoodieStorageConfig.LOGFILE_DATA_BLOCK_FORMAT.key -> "avro") - writeOpts = writeOpts ++ Map(HoodieStorageConfig.LOGFILE_DATA_BLOCK_FORMAT.key -> "avro") + readOpts = readOpts ++ Map(HoodieStorageConfig.LOGFILE_DATA_BLOCK_FORMAT.key -> "parquet") + writeOpts = writeOpts ++ Map(HoodieStorageConfig.LOGFILE_DATA_BLOCK_FORMAT.key -> "parquet") ``` </details> <!-- suggestion_start --> <details> <summary>📝 Committable suggestion</summary> > ‼️ **IMPORTANT** > Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements. ```suggestion readOpts = readOpts ++ Map(HoodieStorageConfig.LOGFILE_DATA_BLOCK_FORMAT.key -> "parquet") writeOpts = writeOpts ++ Map(HoodieStorageConfig.LOGFILE_DATA_BLOCK_FORMAT.key -> "parquet") ``` </details> <!-- suggestion_end --> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala` around lines 1490 - 1491, The test testRecordTypeCompatibilityWithParquetLog was changed to force HoodieStorageConfig.LOGFILE_DATA_BLOCK_FORMAT to "avro" (via readOpts and writeOpts), so it no longer validates parquet log-block behavior; revert or adjust these assignments so the test actually uses "parquet" (e.g., set readOpts and writeOpts to Map(HoodieStorageConfig.LOGFILE_DATA_BLOCK_FORMAT.key -> "parquet") or remove the override), or alternatively split/rename the test into separate cases (one explicitly setting "avro" and another keeping "parquet") so the parquet-log reader remains covered. ``` </details> <!-- fingerprinting:phantom:medusa:grasshopper:163599a9-90f2-46bb-9a83-b92fc3dd8804 --> <!-- This is an auto-generated comment by CodeRabbit --> — *CodeRabbit* ([original](https://github.com/yihua/hudi/pull/39#discussion_r3082895765)) (source:comment#3082895765) ########## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/others/TestMergeModeCommitTimeOrdering.scala: ########## @@ -100,8 +102,11 @@ class TestMergeModeCommitTimeOrdering extends HoodieSparkSqlTestBase { // TODO(HUDI-8820): enable MDT after supporting MDT with table version 6 ("hoodie.metadata.enable" -> "false", tableVersion.toInt == 6) ) { - withRecordType()(withTempDir { tmp => + withRecordType()({ val tableName = generateTableName + val currTime = System.currentTimeMillis() Review Comment: <a href="#"><img alt="P1" src="https://greptile-static-assets.s3.amazonaws.com/badges/p1.svg?v=7" align="top"></a> **Hardcoded developer-local path and global mutable state** Two related problems here: 1. **Hardcoded path**: `s"/home/ubuntu/ws3/hudi-rs/tables/table${TestMergeModeCommitTimeOrdering.cnt}_$currTime"` is a hardcoded path on a specific developer's machine. Tests will fail on any other CI or developer environment. 2. **Global mutable `cnt` in companion object**: The companion object `TestMergeModeCommitTimeOrdering.cnt` is a JVM-singleton `var` incremented across test invocations. This is a form of shared mutable state that makes test execution order-dependent and could cause path collisions or unexpected behavior if tests are run in parallel. Both issues stem from the same root cause — the code was modified to write tables to a fixed developer path for manual inspection with hudi-rs. This logic should be reverted to use `withTempDir { tmp => ... }` as in the original code. The same pattern appears in the second test block in this file (~line 255). — *Greptile* ([original](https://github.com/yihua/hudi/pull/39#discussion_r3082852816)) (source:comment#3082852816) ########## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/others/TestMORFileSliceLayouts.scala: ########## @@ -0,0 +1,823 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.spark.sql.hudi.dml.others + +import org.apache.hudi.HoodieCLIUtils +import org.apache.hudi.common.model.{FileSlice, HoodieLogFile} +import org.apache.hudi.common.model.HoodieRecord.HoodieRecordType +import org.apache.hudi.common.table.{HoodieTableMetaClient, TableSchemaResolver} +import org.apache.hudi.common.table.log.HoodieLogFileReader +import org.apache.hudi.common.table.log.block.HoodieLogBlock.{HeaderMetadataType, HoodieLogBlockType} +import org.apache.hudi.common.testutils.HoodieTestUtils +import org.apache.hudi.common.util.{Option => HOption} +import org.apache.hudi.testutils.DataSourceTestUtils + +import org.apache.hadoop.fs.Path +import org.apache.spark.sql.SaveMode +import org.apache.spark.sql.hudi.common.HoodieSparkSqlTestBase +import org.apache.spark.sql.hudi.common.HoodieSparkSqlTestBase.getMetaClientAndFileSystemView + +import java.util.Collections + +import scala.collection.JavaConverters._ + +/** + * Creates 4 MOR tables with specific file slice layouts, validates each layout + * via the same HoodieTableFileSystemView + getLatestMergedFileSlicesBeforeOrOn + * code path that SELECT uses, then saves gold data alongside each table. + */ +class TestMORFileSliceLayouts extends HoodieSparkSqlTestBase { + + val BASE_OUTPUT_PATH = "/home/ubuntu/ws3/hudi-rs/tables/fdss" + + private def cleanPath(path: String): Unit = { + val hadoopPath = new Path(path) + val fs = hadoopPath.getFileSystem(spark.sessionState.newHadoopConf()) + if (fs.exists(hadoopPath)) { + fs.delete(hadoopPath, true) + } Review Comment: _⚠️ Potential issue_ | _🟠 Major_ **Don't hard-code a workstation path into an automated test suite.** These tests assume `/home/ubuntu/...` exists and `cleanPath` recursively deletes under it. That will fail on CI/non-Ubuntu hosts and is unsafe on developer machines. Use a harness-provided temp directory instead. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/others/TestMORFileSliceLayouts.scala` around lines 48 - 55, Replace the hard-coded BASE_OUTPUT_PATH and the unsafe cleanPath usage with a test-scoped temporary directory: change BASE_OUTPUT_PATH to be created at runtime (e.g., use java.nio.file.Files.createTempDirectory or the test harness/tempDir utility) and ensure cleanPath operates on that temp dir only (or remove cleanPath and rely on the temp directory lifecycle). Update references to BASE_OUTPUT_PATH and the cleanPath helper so all tests use this generated path, and ensure the temp directory is deleted/cleaned in an afterEach/afterAll teardown to avoid leaving artifacts. ``` </details> <!-- fingerprinting:phantom:medusa:grasshopper:163599a9-90f2-46bb-9a83-b92fc3dd8804 --> <!-- This is an auto-generated comment by CodeRabbit --> — *CodeRabbit* ([original](https://github.com/yihua/hudi/pull/39#discussion_r3082895785)) (source:comment#3082895785) ########## hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/io/TestMergeHandle.java: ########## @@ -277,7 +278,7 @@ void testWriteFailures() throws Exception { } @ParameterizedTest - @ValueSource(strings = {"EVENT_TIME_ORDERING", "COMMIT_TIME_ORDERING", "CUSTOM", "CUSTOM_MERGER"}) + @ValueSource(strings = {"EVENT_TIME_ORDERING", "COMMIT_TIME_ORDERING"}) public void testFGReaderBasedMergeHandleInsertUpsertDelete(String mergeMode) throws IOException { testFGReaderBasedMergeHandleInsertUpsertDeleteInternal(mergeMode, new Properties(), false); Review Comment: _⚠️ Potential issue_ | _🟠 Major_ **Keep explicit coverage for `CUSTOM` merge modes.** `testFGReaderBasedMergeHandleInsertUpsertDeleteInternal(...)` still has dedicated `CUSTOM` / `CUSTOM_MERGER` branches, so removing those values from `@ValueSource` just makes those paths untested. If they are unsupported only in the Gluten run, gate them there or replace them with an explicit unsupported assertion instead of dropping the coverage globally. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/io/TestMergeHandle.java` around lines 280 - 283, The test removed CUSTOM/CUSTOM_MERGER branch coverage; restore explicit coverage by including "CUSTOM" and "CUSTOM_MERGER" in the `@ValueSource` for testFGReaderBasedMergeHandleInsertUpsertDelete or, if those modes are genuinely unsupported under the Gluten environment, add a guard inside testFGReaderBasedMergeHandleInsertUpsertDelete (calling testFGReaderBasedMergeHandleInsertUpsertDeleteInternal) that detects the Gluten run (e.g., an existing isGluten()/environment check) and asserts the unsupported behavior for CUSTOM/CUSTOM_MERGER instead of silently skipping them; update references to testFGReaderBasedMergeHandleInsertUpsertDeleteInternal, the ValueSource annotation, and any Gluten-run detection logic accordingly. ``` </details> <!-- fingerprinting:phantom:medusa:grasshopper:79e7aa6f-65c0-490f-bb56-4d8aab80feac --> <!-- This is an auto-generated comment by CodeRabbit --> — *CodeRabbit* ([original](https://github.com/yihua/hudi/pull/39#discussion_r3082895745)) (source:comment#3082895745) ########## pom.xml: ########## @@ -170,7 +170,7 @@ <rocksdbjni.version>7.5.3</rocksdbjni.version> <spark33.version>3.3.4</spark33.version> <spark34.version>3.4.3</spark34.version> - <spark35.version>3.5.5</spark35.version> + <spark35.version>3.5.2-INTERNAL-0.16</spark35.version> Review Comment: _⚠️ Potential issue_ | _🔴 Critical_ <details> <summary>🧩 Analysis chain</summary> 🏁 Script executed: ```shell # First, let me examine the pom.xml file structure and the specific lines mentioned wc -l pom.xml ``` Repository: yihua/hudi Length of output: 66 --- 🏁 Script executed: ```shell # Check the area around line 173 for the spark35.version sed -n '170,180p' pom.xml ``` Repository: yihua/hudi Length of output: 631 --- 🏁 Script executed: ```shell # Check the area around lines 1966-1970 for the Onehouse CodeArtifact repository sed -n '1960,1975p' pom.xml ``` Repository: yihua/hudi Length of output: 479 --- 🏁 Script executed: ```shell # Search for profile definitions to verify which is the default profile rg -n '<profile>' pom.xml | head -20 ``` Repository: yihua/hudi Length of output: 433 --- 🏁 Script executed: ```shell # Look for activeByDefault or default profile settings rg -n -A 3 'activeByDefault\|<id>spark3' pom.xml ``` Repository: yihua/hudi Length of output: 36 --- 🏁 Script executed: ```shell # Search for spark3.5 profile definition and its activation settings rg -n -A 15 '<id>spark3\.5' pom.xml ``` Repository: yihua/hudi Length of output: 1058 --- 🏁 Script executed: ```shell # Look for all profiles with their IDs and activation settings rg -n -B 2 -A 8 '<profile>' pom.xml | grep -E '(profile>|<id>|activeByDefault|activation)' ``` Repository: yihua/hudi Length of output: 2375 --- 🏁 Script executed: ```shell # Check line 1974 onwards to see the first profile definition after repositories sed -n '1974,2050p' pom.xml ``` Repository: yihua/hudi Length of output: 2346 --- 🏁 Script executed: ```shell # Get the complete spark3.5 profile definition to check for activeByDefault sed -n '2656,2715p' pom.xml ``` Repository: yihua/hudi Length of output: 3074 --- 🏁 Script executed: ```shell # Check if there are any activeByDefault declarations in profiles rg -n 'activeByDefault' pom.xml ``` Repository: yihua/hudi Length of output: 159 --- 🏁 Script executed: ```shell # Check properties section to understand default spark module sed -n '140,200p' pom.xml ``` Repository: yihua/hudi Length of output: 3710 --- </details> **This makes the default Spark 3.5 build depend on private infrastructure.** `spark3.5` is the active-by-default profile, and the default properties use `${spark35.version}` which now points to `3.5.2-INTERNAL-0.16`. With the Onehouse CodeArtifact repository configured in the global repositories section, every default build will attempt to resolve this internal coordinate from private AWS infrastructure. Public CI and external contributors cannot access this repository without Onehouse credentials. Please keep the public Spark coordinate as default and move the internal version and repository behind an explicit opt-in profile. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@pom.xml` at line 173, The pom currently sets the Maven property spark35.version to the internal coordinate "3.5.2-INTERNAL-0.16" which makes the active-by-default profile spark3.5 depend on private Onehouse CodeArtifact repos; revert the default <spark35.version> to the public Spark artifact (e.g., a public 3.5.x coordinate) and move the internal "3.5.2-INTERNAL-0.16" value and the Onehouse repository declaration into a new opt-in Maven profile (e.g., profile id "onehouse-internal" or similar); update references to spark35.version to remain unchanged so external builds use the public coordinate unless the opt-in profile is activated. ``` </details> <!-- fingerprinting:phantom:medusa:grasshopper:79e7aa6f-65c0-490f-bb56-4d8aab80feac --> <!-- This is an auto-generated comment by CodeRabbit --> — *CodeRabbit* ([original](https://github.com/yihua/hudi/pull/39#discussion_r3082895791)) (source:comment#3082895791) ########## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/others/TestMORFileSliceLayouts.scala: ########## @@ -0,0 +1,823 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.spark.sql.hudi.dml.others + +import org.apache.hudi.HoodieCLIUtils +import org.apache.hudi.common.model.{FileSlice, HoodieLogFile} +import org.apache.hudi.common.model.HoodieRecord.HoodieRecordType +import org.apache.hudi.common.table.{HoodieTableMetaClient, TableSchemaResolver} +import org.apache.hudi.common.table.log.HoodieLogFileReader +import org.apache.hudi.common.table.log.block.HoodieLogBlock.{HeaderMetadataType, HoodieLogBlockType} +import org.apache.hudi.common.testutils.HoodieTestUtils +import org.apache.hudi.common.util.{Option => HOption} +import org.apache.hudi.testutils.DataSourceTestUtils + +import org.apache.hadoop.fs.Path +import org.apache.spark.sql.SaveMode +import org.apache.spark.sql.hudi.common.HoodieSparkSqlTestBase +import org.apache.spark.sql.hudi.common.HoodieSparkSqlTestBase.getMetaClientAndFileSystemView + +import java.util.Collections + +import scala.collection.JavaConverters._ + +/** + * Creates 4 MOR tables with specific file slice layouts, validates each layout + * via the same HoodieTableFileSystemView + getLatestMergedFileSlicesBeforeOrOn + * code path that SELECT uses, then saves gold data alongside each table. + */ +class TestMORFileSliceLayouts extends HoodieSparkSqlTestBase { + + val BASE_OUTPUT_PATH = "/home/ubuntu/ws3/hudi-rs/tables/fdss" + + private def cleanPath(path: String): Unit = { + val hadoopPath = new Path(path) + val fs = hadoopPath.getFileSystem(spark.sessionState.newHadoopConf()) + if (fs.exists(hadoopPath)) { + fs.delete(hadoopPath, true) + } + } + + // ─── Core validation helper ─── + // Returns the file slices exactly as the SELECT query path would see them: + // HoodieTableFileSystemView → getLatestMergedFileSlicesBeforeOrOn(partition, queryInstant) + // This mirrors BaseHoodieTableFileIndex.filterFiles (line 342-355). + private def getQueryFileSlices(basePath: String): Seq[FileSlice] = { + val (metaClient, fsView) = getMetaClientAndFileSystemView(basePath) + val queryInstant = metaClient.getActiveTimeline + .filterCompletedInstants().lastInstant().get().requestedTime() + fsView.getLatestMergedFileSlicesBeforeOrOn("", queryInstant) + .iterator().asScala.toSeq + } Review Comment: _⚠️ Potential issue_ | _🟠 Major_ **Close the file-system view and log readers on every exit path.** `getMetaClientAndFileSystemView` creates a `HoodieTableFileSystemView` that is never closed, and each `HoodieLogFileReader` only closes on the happy path. A failing read/assertion here will leak handles across this suite. Also applies to: 111-131 <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/others/TestMORFileSliceLayouts.scala` around lines 62 - 68, getQueryFileSlices currently obtains resources from getMetaClientAndFileSystemView (which returns a HoodieTableFileSystemView) and iterates log readers but does not guarantee closure on exception paths; update getQueryFileSlices to ensure the HoodieTableFileSystemView and any HoodieLogFileReader instances are closed in all exit paths (use try/finally or a loan/withResource pattern) so that fsView.close() is always called and each log reader is closed even if an assertion or read fails; locate getMetaClientAndFileSystemView and any code that constructs HoodieLogFileReader (also in the 111-131 region) and wrap their usage in try/finally blocks or a managed-resource helper to release resources reliably. ``` </details> <!-- fingerprinting:phantom:medusa:grasshopper:163599a9-90f2-46bb-9a83-b92fc3dd8804 --> <!-- This is an auto-generated comment by CodeRabbit --> — *CodeRabbit* ([original](https://github.com/yihua/hudi/pull/39#discussion_r3082895786)) (source:comment#3082895786) ########## hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/table/functional/TestHoodieSparkMergeOnReadTableRollback.java: ########## @@ -1157,6 +1158,7 @@ private HoodieWriteConfig getWriteConfig(boolean autoCommit, boolean rollbackUsi /** * Scenario: data table is updated, no changes to MDT */ + @Disabled("Table version 6 only") Review Comment: _⚠️ Potential issue_ | _🟠 Major_ <details> <summary>🧩 Analysis chain</summary> 🏁 Script executed: ```shell #!/bin/bash # Verify these v6 rollback-failure tests are currently hard-disabled in both rollback suites. rg -n --type=java '@Disabled\("Table version 6 only"\)|testRollbackWithFailurePreMDT|testRollbackWithFailurePostMDT|testRollbackWithFailurePostMDTRollbackFailsPreMDT|testRollbackWithFailureInMDT' \ hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/table/functional/TestHoodieSparkMergeOnReadTableRollback.java \ hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/table/functional/TestHoodieSparkCopyOnWriteTableRollbackTableVersionSix.java ``` Repository: yihua/hudi Length of output: 4310 --- 🏁 Script executed: ```shell # Check for existing `@Tag` usage in the test files to confirm pattern compatibility rg -n '@Tag\(' hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/table/functional/ | head -20 ``` Repository: yihua/hudi Length of output: 969 --- 🏁 Script executed: ```shell # Check Maven pom.xml for test profiles or tag-based test execution configuration fd pom.xml | head -5 | xargs -I {} sh -c 'echo "=== {} ===" && grep -A 5 -B 5 "tag\|profile\|maven-surefire" {} 2>/dev/null | head -30' ``` Repository: yihua/hudi Length of output: 5745 --- 🏁 Script executed: ```shell # Look for any CI configuration files that might show test execution strategy fd -e '.yml' -e '.yaml' -e '.sh' | grep -i 'ci\|github\|jenkins\|test' | head -10 ``` Repository: yihua/hudi Length of output: 767 --- 🏁 Script executed: ```shell # Search for Maven Surefire plugin configuration in the main pom.xml find . -name "pom.xml" -path "*/hudi-spark-datasource/*" | head -1 | xargs grep -A 20 "maven-surefire-plugin" | head -40 ``` Repository: yihua/hudi Length of output: 36 --- 🏁 Script executed: ```shell # Look for test-related profiles in pom.xml files rg -A 10 '<profile>.*test|includedGroups|groupsToInclude' --type=xml | head -50 ``` Repository: yihua/hudi Length of output: 36 --- 🏁 Script executed: ```shell # Check for GitHub Actions workflow files fd -e '.yml' -e '.yaml' .github/workflows 2>/dev/null | head -10 | xargs cat ``` Repository: yihua/hudi Length of output: 36 --- 🏁 Script executed: ```shell # Search for root pom.xml and check for test execution configuration cat pom.xml 2>/dev/null | grep -A 5 -B 5 "surefire\|maven-failsafe" || echo "No root pom.xml or no surefire config" ``` Repository: yihua/hudi Length of output: 7972 --- 🏁 Script executed: ```shell # Check project structure to understand if Maven or Gradle ls -la | grep -E 'pom\.xml|build\.gradle|gradlew' ``` Repository: yihua/hudi Length of output: 113 --- 🏁 Script executed: ```shell # Look for any CI configuration files more broadly find . -maxdepth 3 -type f \( -name '*.yml' -o -name '*.yaml' \) 2>/dev/null | head -20 ``` Repository: yihua/hudi Length of output: 941 --- 🏁 Script executed: ```shell # Check the GitHub Actions workflow for functional tests to see how they're executed cat .github/workflows/azure_ci_check.yml | head -100 ``` Repository: yihua/hudi Length of output: 3054 --- 🏁 Script executed: ```shell # Verify the complete Maven profile structure for tag-based filtering grep -A 25 '<profile>' pom.xml | grep -A 25 'functional-b\|functional-c\|excludedGroups\|groups>' ``` Repository: yihua/hudi Length of output: 3347 --- </details> **Replace `@Disabled` with `@Tag` to restore conditional test execution.** These four rollback failure-path tests (lines 1161, 1170, 1180, 1189) are completely skipped in all runs due to hard disabling. The codebase already uses `@Tag` + Maven profiles (e.g., `functional-tests`, `functional-tests-b`, `functional-tests-c`) to organize version-specific tests. Apply the same pattern here: <details> <summary>Required changes</summary> 1. Replace `@Disabled("Table version 6 only")` with `@Tag("table-version-6")` on all four methods in both `TestHoodieSparkMergeOnReadTableRollback.java` and `TestHoodieSparkCopyOnWriteTableRollbackTableVersionSix.java`. 2. Add a Maven profile in `pom.xml` (or extend an existing functional-tests profile) to include the `table-version-6` tag: ```xml <profile> <id>functional-tests-v6</id> <configuration> <groups>table-version-6</groups> </configuration> </profile> ``` 3. Wire this profile into CI (Azure CI) to ensure these tests run in the appropriate validation suite. </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/table/functional/TestHoodieSparkMergeOnReadTableRollback.java` at line 1161, Replace the hard-disabled annotations so the rollback failure-path tests run conditionally: in TestHoodieSparkMergeOnReadTableRollback.java and TestHoodieSparkCopyOnWriteTableRollbackTableVersionSix.java, change each method annotated with `@Disabled`("Table version 6 only") to `@Tag`("table-version-6") (the four rollback test methods around the existing markers). Then add/extend a Maven profile (e.g., functional-tests-v6) in pom.xml to include the table-version-6 tag via the surefire/failsafe groups configuration, and ensure the Azure CI pipeline triggers that profile in the appropriate validation job so the table-version-6 tests execute in CI. ``` </details> <!-- fingerprinting:phantom:poseidon:hawk:91643543-15be-490b-8f5c-8500eba38510 --> <!-- This is an auto-generated comment by CodeRabbit --> — *CodeRabbit* ([original](https://github.com/yihua/hudi/pull/39#discussion_r3082895752)) (source:comment#3082895752) ########## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/others/TestMergeModeCommitTimeOrdering.scala: ########## @@ -100,8 +102,11 @@ class TestMergeModeCommitTimeOrdering extends HoodieSparkSqlTestBase { // TODO(HUDI-8820): enable MDT after supporting MDT with table version 6 ("hoodie.metadata.enable" -> "false", tableVersion.toInt == 6) ) { - withRecordType()(withTempDir { tmp => + withRecordType()({ val tableName = generateTableName + val currTime = System.currentTimeMillis() + val tablePath = s"/home/ubuntu/ws3/hudi-rs/tables/table${TestMergeModeCommitTimeOrdering.cnt}_$currTime" + println(s"fdss: ${tablePath}") Review Comment: _⚠️ Potential issue_ | _🔴 Critical_ **Critical: Remove hardcoded absolute path and debug statement.** This hardcoded path (`/home/ubuntu/ws3/hudi-rs/tables/...`) will fail on any machine other than the original developer's. The `println(s"fdss: ${tablePath}")` appears to be a debugging artifact. Use the test framework's temp directory mechanism (e.g., `withTempDir`) instead of hardcoded paths. <details> <summary>Suggested fix pattern</summary> ```diff - val currTime = System.currentTimeMillis() - val tablePath = s"/home/ubuntu/ws3/hudi-rs/tables/table${TestMergeModeCommitTimeOrdering.cnt}_$currTime" - println(s"fdss: ${tablePath}") + withTempDir { tmp => + val tablePath = tmp.getCanonicalPath ``` And restore the original `withTempDir` pattern that was likely removed. </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/others/TestMergeModeCommitTimeOrdering.scala` around lines 107 - 109, The test currently uses a hardcoded absolute path via tablePath built from TestMergeModeCommitTimeOrdering.cnt and currTime and prints it with println (a debug artifact); replace this by using the test framework's temporary directory helper (e.g., withTempDir) to create tablePath inside the temp dir and remove the println line; locate the usage where currTime/tablePath/Println are set in TestMergeModeCommitTimeOrdering and restore the original withTempDir pattern so the path is portable and no debug output remains. ``` </details> <!-- fingerprinting:phantom:medusa:ocelot --> <!-- This is an auto-generated comment by CodeRabbit --> — *CodeRabbit* ([original](https://github.com/yihua/hudi/pull/39#discussion_r3082895779)) (source:comment#3082895779) ########## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala: ########## @@ -583,6 +576,9 @@ class TestMORDataSource extends HoodieSparkClientTestBase with SparkDatasetMixin // First Operation: // Producing parquet files to three default partitions. // SNAPSHOT view on MOR table with parquet files only. Review Comment: <a href="#"><img alt="P2" src="https://greptile-static-assets.s3.amazonaws.com/badges/p2.svg?v=7" align="top"></a> **Leftover commented-out developer debug path** These two commented-out lines are clearly debugging artifacts from manual testing with hudi-rs: ```scala // val BASE_OUTPUT_PATH = "/home/ubuntu/ws3/hudi-rs/tables/testPrunedFiltered" // val tablePath = s"$BASE_OUTPUT_PATH/table_log_compaction_${System.currentTimeMillis()}" ``` They should be removed entirely from the file before merging. — *Greptile* ([original](https://github.com/yihua/hudi/pull/39#discussion_r3082852883)) (source:comment#3082852883) ########## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/common/table/read/TestHoodieFileGroupReaderOnSpark.scala: ########## @@ -226,6 +674,7 @@ class TestHoodieFileGroupReaderOnSpark extends TestHoodieFileGroupReaderBase[Int (10, "3", "rider-C", "driver-C", 33.9, "i"), (20, "1", "rider-Z", "driver-Z", 27.7, "i")) + @Disabled("Custom delete payload not supported") @ParameterizedTest Review Comment: _⚠️ Potential issue_ | _🟠 Major_ **Don't permanently disable custom-delete coverage.** Gluten is only enabled when `gluten.bundle.jar` is present, but this annotation removes the custom-delete path from normal Spark runs too. Skip it conditionally when native execution is active instead of disabling the test for every profile. <details> <summary>Conditional-skip example</summary> ```diff -import org.junit.jupiter.api.{AfterEach, BeforeEach, Disabled, Test} +import org.junit.jupiter.api.{AfterEach, Assumptions, BeforeEach, Test} ... - `@Disabled`("Custom delete payload not supported") `@ParameterizedTest` `@MethodSource`(Array("customDeleteTestParams")) def testCustomDelete(useFgReader: String, tableType: String, positionUsed: String, mergeMode: String): Unit = { + Assumptions.assumeTrue( + System.getProperty("gluten.bundle.jar") == null, + "Custom delete payload is not supported with Gluten/Velox") ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/common/table/read/TestHoodieFileGroupReaderOnSpark.scala` around lines 677 - 678, The test is permanently disabled with `@Disabled`("Custom delete payload not supported"), removing custom-delete coverage for all runs; instead remove the `@Disabled` and add a conditional skip that only disables the test when native/Gluten execution is active (e.g., use JUnit Assumptions inside the test: call Assumptions.assumeFalse(isGlutenNativeExecutionActive()) or Assumptions.assumeTrue(!isGlutenNativeExecutionActive()), where isGlutenNativeExecutionActive() checks the presence of the gluten bundle (gluten.bundle.jar) or existing project helper), so the test in TestHoodieFileGroupReaderOnSpark runs normally under standard Spark profiles but is skipped only when native execution is enabled. ``` </details> <!-- fingerprinting:phantom:medusa:grasshopper:163599a9-90f2-46bb-9a83-b92fc3dd8804 --> <!-- This is an auto-generated comment by CodeRabbit --> — *CodeRabbit* ([original](https://github.com/yihua/hudi/pull/39#discussion_r3082895760)) (source:comment#3082895760) ########## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/others/TestMORFileSliceLayouts.scala: ########## @@ -0,0 +1,823 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.spark.sql.hudi.dml.others + +import org.apache.hudi.HoodieCLIUtils +import org.apache.hudi.common.model.{FileSlice, HoodieLogFile} +import org.apache.hudi.common.model.HoodieRecord.HoodieRecordType +import org.apache.hudi.common.table.{HoodieTableMetaClient, TableSchemaResolver} +import org.apache.hudi.common.table.log.HoodieLogFileReader +import org.apache.hudi.common.table.log.block.HoodieLogBlock.{HeaderMetadataType, HoodieLogBlockType} +import org.apache.hudi.common.testutils.HoodieTestUtils +import org.apache.hudi.common.util.{Option => HOption} +import org.apache.hudi.testutils.DataSourceTestUtils + +import org.apache.hadoop.fs.Path +import org.apache.spark.sql.SaveMode +import org.apache.spark.sql.hudi.common.HoodieSparkSqlTestBase +import org.apache.spark.sql.hudi.common.HoodieSparkSqlTestBase.getMetaClientAndFileSystemView + +import java.util.Collections + +import scala.collection.JavaConverters._ + +/** + * Creates 4 MOR tables with specific file slice layouts, validates each layout + * via the same HoodieTableFileSystemView + getLatestMergedFileSlicesBeforeOrOn + * code path that SELECT uses, then saves gold data alongside each table. + */ +class TestMORFileSliceLayouts extends HoodieSparkSqlTestBase { + + val BASE_OUTPUT_PATH = "/home/ubuntu/ws3/hudi-rs/tables/fdss" + + private def cleanPath(path: String): Unit = { + val hadoopPath = new Path(path) + val fs = hadoopPath.getFileSystem(spark.sessionState.newHadoopConf()) + if (fs.exists(hadoopPath)) { + fs.delete(hadoopPath, true) Review Comment: <a href="#"><img alt="P1" src="https://greptile-static-assets.s3.amazonaws.com/badges/p1.svg?v=7" align="top"></a> **Hardcoded developer-local path used as base output directory** `val BASE_OUTPUT_PATH = "/home/ubuntu/ws3/hudi-rs/tables/fdss"` is a hardcoded absolute path on a specific developer's machine. All four tests in this file write their tables to sub-directories of this path. These tests will fail (or worse, silently write data to an unexpected location) on any other machine. This should be replaced with a temporary directory, consistent with how other test classes in the project handle isolation, e.g.: ```suggestion val BASE_OUTPUT_PATH = createTempDir().getAbsolutePath ``` Or use the `withTempDir` pattern already used throughout `HoodieSparkSqlTestBase`. — *Greptile* ([original](https://github.com/yihua/hudi/pull/39#discussion_r3082852745)) (source:comment#3082852745) ########## RCA_Gluten_Velox_MOR_Failures.md: ########## @@ -0,0 +1,373 @@ +# RCA: Gluten/Velox MOR Test Failures + +Three sub-tests fail when Gluten/Velox native execution is enabled: +`testMORWithMap`, `testMORWithDecimal`, `testMORWithArray`. + +Each has a distinct root cause. This document traces execution with raw code and supporting log evidence. + +--- + +## 1. `testMORWithMap` — `sizeInBytes (276) should be a multiple of 8` + +### Symptom + +``` +java.lang.AssertionError: sizeInBytes (276) should be a multiple of 8 + at org.apache.spark.sql.catalyst.expressions.UnsafeRow.setTotalSize(UnsafeRow.java:172) + at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.getRow(UnsafeRowWriter.java:78) + at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) + at org.apache.spark.sql.catalyst.expressions.codegen.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) + at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$...$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43) + at HoodieSparkUtils$.$anonfun$createRdd$2(HoodieSparkUtils.scala:107) +``` Review Comment: _⚠️ Potential issue_ | _🟡 Minor_ **Add languages to the fenced code blocks.** `markdownlint` is already flagging these fences, so this doc will keep warning until the blocks are tagged (`text`, `scala`, `java`, `cpp`, etc.). Also applies to: 32-36, 82-85, 184-203, 206-210, 245-251, 254-264, 320-327, 347-351 <details> <summary>🧰 Tools</summary> <details> <summary>🪛 markdownlint-cli2 (0.22.0)</summary> [warning] 14-14: Fenced code blocks should have a language specified (MD040, fenced-code-language) </details> </details> <details> <summary>🤖 Prompt for AI Agents</summary> ```` Verify each finding against the current code and only fix it if needed. In `@RCA_Gluten_Velox_MOR_Failures.md` around lines 14 - 22, Several fenced code blocks in the document are missing language specifiers; update each triple-backtick fence to include an appropriate language token (e.g., ```java for the Java stacktrace starting with java.lang.AssertionError and entries like UnsafeRow.setTotalSize, ```scala for Scala snippets referencing HoodieSparkUtils.createRdd, or ```text/```cpp as appropriate) so markdownlint stops flagging them; apply this change to the reported ranges (including the stacktrace block and the other blocks at 32-36, 82-85, 184-203, 206-210, 245-251, 254-264, 320-327, 347-351) ensuring each fence uses the correct language tag that matches the contained code/sample. ```` </details> <!-- fingerprinting:phantom:medusa:grasshopper:163599a9-90f2-46bb-9a83-b92fc3dd8804 --> <!-- This is an auto-generated comment by CodeRabbit --> — *CodeRabbit* ([original](https://github.com/yihua/hudi/pull/39#discussion_r3082895798)) (source:comment#3082895798) ########## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/others/TestMergeModeCommitTimeOrdering.scala: ########## @@ -249,8 +255,11 @@ class TestMergeModeCommitTimeOrdering extends HoodieSparkSqlTestBase { // TODO(HUDI-8820): enable MDT after supporting MDT with table version 6 ("hoodie.metadata.enable" -> "false", tableVersion.toInt == 6) ) { - withRecordType()(withTempDir { tmp => + withRecordType()({ val tableName = generateTableName + val currTime = System.currentTimeMillis() + val tablePath = s"/home/ubuntu/ws3/hudi-rs/tables/table${TestMergeModeCommitTimeOrdering.cnt}_$currTime" + println(s"fdss: ${tablePath}") Review Comment: _⚠️ Potential issue_ | _🔴 Critical_ **Critical: Same hardcoded path and debug statement in second test.** This is a duplicate of the issue in the first test - hardcoded absolute path and debug println must be removed. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/others/TestMergeModeCommitTimeOrdering.scala` around lines 260 - 262, Remove the hardcoded absolute path and debug println: in the TestMergeModeCommitTimeOrdering test where tablePath is constructed using TestMergeModeCommitTimeOrdering.cnt and System.currentTimeMillis(), replace the absolute "/home/ubuntu/..." path with a temp or relative test directory (or use the test framework's temporary folder utility) and delete the println(s"fdss: ${tablePath}") call; ensure tablePath is built from a configurable base (or tempDir) so the test is not environment-specific. ``` </details> <!-- fingerprinting:phantom:medusa:ocelot --> <!-- This is an auto-generated comment by CodeRabbit --> — *CodeRabbit* ([original](https://github.com/yihua/hudi/pull/39#discussion_r3082895783)) (source:comment#3082895783) ########## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala: ########## @@ -2259,6 +2256,59 @@ class TestMORDataSource extends HoodieSparkClientTestBase with SparkDatasetMixin .mode(SaveMode.Append) .save(basePath) } + + /** + * Smoke test for Gluten/Velox native execution. + * Writes a v9 MOR table (Spark record type, parquet log blocks), reads it back, + * runs collect(), then prints the executedPlan so the caller can confirm that + * Gluten operators appear (e.g. VeloxColumnarToRow, ^(N) FileScanTransformer HudiFileGroup). + * + * Run with: mvn test -Dtest=TestMORDataSource#testGlutenVeloxNativeExecution + * -Dgluten.bundle.jar=<path/to/gluten-velox-bundle-spark3.5*.jar> + */ + @Test + def testGlutenVeloxNativeExecution(): Unit = { + val (writeOpts, readOpts) = getWriterReaderOpts(HoodieRecordType.SPARK) + val v9Opts = writeOpts ++ Map( + HoodieTableConfig.VERSION.key() -> HoodieTableVersion.NINE.versionCode().toString, + HoodieWriteConfig.WRITE_TABLE_VERSION.key() -> HoodieTableVersion.NINE.versionCode().toString + ) + + // Initial insert — 10 records across partitions + val records = recordsToStrings(dataGen.generateInserts("001", 10)).asScala.toSeq + val inputDF = spark.read.json(spark.sparkContext.parallelize(records, 2)) + inputDF.write.format("hudi") + .options(v9Opts) + .option(DataSourceWriteOptions.TABLE_TYPE.key, DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL) + .option(DataSourceWriteOptions.OPERATION.key, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL) + .option("hoodie.compact.inline", "false") + .mode(SaveMode.Overwrite) + .save(basePath) + + // Upsert 5 records to create delta log files (MOR characteristic) + val updates = recordsToStrings(dataGen.generateUniqueUpdates("002", 5)).asScala.toSeq + val updatesDF = spark.read.json(spark.sparkContext.parallelize(updates, 2)) + updatesDF.write.format("hudi") + .options(v9Opts) + .option(DataSourceWriteOptions.OPERATION.key, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) + .mode(SaveMode.Append) + .save(basePath) + + // Read back with snapshot query and project a few columns + val hudiDF = spark.read.format("hudi") + .options(readOpts) + .option(DataSourceReadOptions.QUERY_TYPE.key, DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) + .load(basePath) + val df = hudiDF.select("driver", "rider", "fare") + + val rows = df.collect() + println(s"[GLUTEN-TEST] Row count: ${rows.length}") + println("[GLUTEN-TEST] == Executed Physical Plan ==") + df.queryExecution.executedPlan.toString.split("\n") + .foreach(line => println(s"[GLUTEN-TEST] $line")) + + assertTrue(rows.length > 0) Review Comment: _⚠️ Potential issue_ | _🟠 Major_ <details> <summary>🧩 Analysis chain</summary> 🏁 Script executed: ```shell # First, let's locate the test file and verify its existence find . -name "TestMORDataSource.scala" -type f ``` Repository: yihua/hudi Length of output: 154 --- 🏁 Script executed: ```shell # Get the line count to understand file size wc -l hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala ``` Repository: yihua/hudi Length of output: 157 --- 🏁 Script executed: ```shell # Extract the specific test method around lines 2269-2310 sed -n '2260,2320p' hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala ``` Repository: yihua/hudi Length of output: 2841 --- 🏁 Script executed: ```shell # Search for other Gluten-related tests to understand pattern rg -i "gluten|velox" hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala -A 5 -B 2 ``` Repository: yihua/hudi Length of output: 1211 --- 🏁 Script executed: ```shell # Check for assumeTrue imports and patterns in the test file rg "import.*Assumptions|assumeTrue|skipTest|skipIf" hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala ``` Repository: yihua/hudi Length of output: 36 --- 🏁 Script executed: ```shell # Look for gluten.bundle.jar usage in the codebase rg "gluten\.bundle\.jar|gluten\.bundle\.|GLUTEN" --type scala ``` Repository: yihua/hudi Length of output: 707 --- 🏁 Script executed: ```shell # Check JUnit imports in the test file head -50 hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala | grep -i "import.*junit" ``` Repository: yihua/hudi Length of output: 122 --- 🏁 Script executed: ```shell # Get more imports from the test file head -100 hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala | grep "^import" ``` Repository: yihua/hudi Length of output: 2719 --- 🏁 Script executed: ```shell # Search for System.getProperty usage in the test file rg "System\.getProperty" hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala ``` Repository: yihua/hudi Length of output: 121 --- 🏁 Script executed: ```shell # Search for System.getProperty usage patterns across test files rg "System\.getProperty" hudi-spark-datasource/hudi-spark/src/test/scala/ --type scala | head -20 ``` Repository: yihua/hudi Length of output: 1199 --- 🏁 Script executed: ```shell # Check for skipTest or assumption patterns in the project rg "assumeTrue|skipTest|skipIf" hudi-spark-datasource/hudi-spark/src/test/scala/ --type scala | head -15 ``` Repository: yihua/hudi Length of output: 358 --- </details> **Assert native execution by checking the plan contains Gluten/Velox operators.** The test currently passes with plain Spark execution since it only checks `rows.length > 0`. Add `assumeTrue` to skip when the Gluten bundle is absent, and assert the executed plan contains expected native operators. <details> <summary>Suggested assertion</summary> ```diff +import org.junit.jupiter.api.Assumptions.assumeTrue ... `@Test` def testGlutenVeloxNativeExecution(): Unit = { + assumeTrue( + System.getProperty("gluten.bundle.jar") != null, + "Gluten/Velox smoke test requires -Dgluten.bundle.jar") + val (writeOpts, readOpts) = getWriterReaderOpts(HoodieRecordType.SPARK) ... - val rows = df.collect() - println(s"[GLUTEN-TEST] Row count: ${rows.length}") - println("[GLUTEN-TEST] == Executed Physical Plan ==") - df.queryExecution.executedPlan.toString.split("\n") - .foreach(line => println(s"[GLUTEN-TEST] $line")) - - assertTrue(rows.length > 0) + val rows = df.collect() + val plan = df.queryExecution.executedPlan.treeString + println(s"[GLUTEN-TEST] Row count: ${rows.length}") + println(s"[GLUTEN-TEST] == Executed Physical Plan ==\n$plan") + + assertTrue(rows.nonEmpty) + assertTrue( + plan.contains("Velox") || plan.contains("Gluten"), + s"Expected Gluten/Velox operators in the executed plan, got:\n$plan") } ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala` around lines 2269 - 2310, The test testGlutenVeloxNativeExecution currently only checks rows.length > 0 so it passes without Gluten/Velox; update it to first assumeTrue that the Gluten native bundle is available (skip otherwise) and then assert that df.queryExecution.executedPlan.toString contains the expected native operator identifiers (e.g., "Velox" or "Gluten" or the specific operator names used by your runtime) after collecting/printing the plan; use the existing df and df.queryExecution.executedPlan references and keep the existing plan printing but replace or augment the final assertTrue(rows.length > 0) with an assumeTrue(...) then an assertion that the plan string contains the native operator marker. ``` </details> <!-- fingerprinting:phantom:medusa:grasshopper:163599a9-90f2-46bb-9a83-b92fc3dd8804 --> <!-- This is an auto-generated comment by CodeRabbit --> — *CodeRabbit* ([original](https://github.com/yihua/hudi/pull/39#discussion_r3082895772)) (source:comment#3082895772) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
