MonsterChenzhuo opened a new pull request, #6404:
URL: https://github.com/apache/hive/pull/6404
### What changes were proposed in this pull request?
This PR fixes a correctness issue in MR execution path for a LEFT JOIN +
aggregated-subquery query shape.
Changes in this PR:
1. `MapredLocalTask`:
- Before local hash-table build, remove transactional internal columns
from hash-table expressions:
- `ROW__ID`
- `INPUT__FILE__NAME`
- `BLOCK__OFFSET__INSIDE__FILE`
- Apply this cleanup in both local-task paths:
- `executeInChildVM(...)`
- `executeInProcess(...)`
- This ensures consistent expression layout/key semantics during local
map-join hash-table build.
2. `ExecMapper`:
- Add bucket-version balancing across related `ReduceSinkOperator` /
`TableScanOperator` nodes during mapper initialization.
- This reduces mismatch risk caused by inconsistent bucketing version
propagation in complex MR pipelines.
Touched files:
- `ql/src/java/org/apache/hadoop/hive/ql/exec/mr/MapredLocalTask.java`
- `ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecMapper.java`
### Why are the changes needed?
For this query pattern:
- outer `LEFT JOIN`
- right side as `JOIN + GROUP BY(stddev)` subquery
single-key runs can return expected results, while full-volume `INSERT
OVERWRITE` may produce a large amount of unexpected NULLs on the right-side
metric column.
The issue is caused by execution-path consistency gaps in MR local
hash-table build and map-side bucket-version handling, which can lead to
large-scale join probe mismatches.
### Does this PR introduce _any_ user-facing change?
Yes (behavioral correctness fix).
Previous behavior:
- For specific query shapes, full-volume runs could produce large-scale
unexpected NULLs in the joined metric column.
New behavior:
- Join result consistency is restored between single-key and full-volume
runs.
- Output matches expected LEFT JOIN semantics for the affected query shape.
### How was this patch tested?
1. Build/compile validation:
- `mvn -pl ql -am -DskipTests -DskipITs compile`
2. Repro query validation (full-volume):
- Run the repro `INSERT OVERWRITE` query (LEFT JOIN + aggregated
subquery).
- Verify NULL ratio of the joined metric column before/after patch.
3. Spot-check consistency:
- Compare a known single-key query result with full-volume partition
output for the same key.
- Confirm the joined metric is no longer unexpectedly NULL.
(If required by reviewers, I can add focused regression tests for this query
shape in follow-up.)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]