abstractdog commented on code in PR #6408:
URL: https://github.com/apache/hive/pull/6408#discussion_r3050378205
##########
ql/src/test/org/apache/hadoop/hive/ql/io/parquet/VectorizedColumnReaderTestBase.java:
##########
@@ -317,14 +317,15 @@ protected static void writeData(ParquetWriter<Group>
writer, boolean isDictionar
g.addGroup("nsf").append("c", intVal).append("d", intVal);
g.append("e", doubleVal);
- Group some_null_g = group.addGroup("struct_field_some_null");
- if (i % 2 != 0) {
- some_null_g.append("f", intVal);
- }
- if (i % 3 != 0) {
- some_null_g.append("g", doubleVal);
+ if (i % 6 != 0) {
+ Group some_null_g = group.addGroup("struct_field_some_null");
Review Comment:
as you're already touching this part, please fix `some_null_g` variable
naming
##########
ql/src/test/queries/clientpositive/parquet_struct_with_null_vectorization.q:
##########
@@ -0,0 +1,31 @@
+SET hive.vectorized.execution.enabled=true;
+set hive.vectorized.execution.reduce.enabled=true;
+SET hive.fetch.task.conversion=none;
+
+CREATE TABLE test_parquet_struct_nulls (
+ id INT,
+ st_prim STRUCT<x:INT, y:INT>
+) STORED AS PARQUET;
+
+INSERT INTO test_parquet_struct_nulls VALUES
+ (1, named_struct('x', CAST(NULL AS INT), 'y', CAST(NULL AS INT))),
+ (2, if(1=0, named_struct('x', 1, 'y', 1), null)),
+ (3, named_struct('x', 3, 'y', CAST(NULL AS INT))),
+ (4, named_struct('x', 4, 'y', 4));
+
+-- Test A: Full table scan to check JSON representation
+SELECT * FROM test_parquet_struct_nulls ORDER BY id;
+
+-- Test B: Verify IS NULL evaluates correctly
+SELECT id FROM test_parquet_struct_nulls WHERE st_prim IS NULL;
+
+-- Test C: Verify IS NOT NULL evaluates correctly
+SELECT id FROM test_parquet_struct_nulls WHERE st_prim IS NOT NULL ORDER BY id;
+
+-- Test D: Verify field-level null evaluation inside a valid struct
+SELECT id FROM test_parquet_struct_nulls WHERE st_prim IS NOT NULL AND
st_prim.x IS NULL;
+
+-- Validate withou vectorization
+SET hive.vectorized.execution.enabled=true;
+SET hive.vectorized.execution.reduce.enabled=false;
Review Comment:
isn't this simpler and closer to the fact "without vectorization"
```
SET hive.vectorized.execution.enabled=false;
```
##########
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedStructColumnReader.java:
##########
@@ -46,14 +48,35 @@ public void readBatch(
fieldReaders.get(i)
.readBatch(total, vectors[i],
structTypeInfo.getAllStructFieldTypeInfos().get(i));
structColumnVector.isRepeating = structColumnVector.isRepeating &&
vectors[i].isRepeating;
+ }
+ int[] defLevels = null;
+ for (VectorizedColumnReader reader : fieldReaders) {
+ defLevels = reader.getDefinitionLevels();
+ if (defLevels != null) {
+ break;
+ }
+ }
- for (int j = 0; j < vectors[i].isNull.length; j++) {
- structColumnVector.isNull[j] =
- (i == 0) ? vectors[i].isNull[j] : structColumnVector.isNull[j] &&
vectors[i].isNull[j];
+ // Evaluate struct nullability using Parquet Definition Levels
+ if (defLevels != null) {
+ for (int j = 0; j < total; j++) {
+ if (defLevels[j] < structDefLevel) {
+ // The D-Level boundary crossed the struct. The whole struct is null.
Review Comment:
nit: is `D-Level` a well-known abbrevation? if so, it's fine, otherwise we
can use `Definition Level` similarly to one comment above
##########
ql/src/test/org/apache/hadoop/hive/ql/io/parquet/VectorizedColumnReaderTestBase.java:
##########
@@ -317,14 +317,15 @@ protected static void writeData(ParquetWriter<Group>
writer, boolean isDictionar
g.addGroup("nsf").append("c", intVal).append("d", intVal);
g.append("e", doubleVal);
- Group some_null_g = group.addGroup("struct_field_some_null");
- if (i % 2 != 0) {
- some_null_g.append("f", intVal);
- }
- if (i % 3 != 0) {
- some_null_g.append("g", doubleVal);
+ if (i % 6 != 0) {
Review Comment:
even if I like the applied math here, which is `i % 2 != 0 || i % 3 != 0` +
De Morgan law's applied to prevent unnecessary `group.addGroup` call, I feel we
can afford this kind of verbosity in the unit test to be easier to read
##########
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedStructColumnReader.java:
##########
@@ -46,14 +48,35 @@ public void readBatch(
fieldReaders.get(i)
.readBatch(total, vectors[i],
structTypeInfo.getAllStructFieldTypeInfos().get(i));
structColumnVector.isRepeating = structColumnVector.isRepeating &&
vectors[i].isRepeating;
+ }
+ int[] defLevels = null;
+ for (VectorizedColumnReader reader : fieldReaders) {
+ defLevels = reader.getDefinitionLevels();
+ if (defLevels != null) {
+ break;
+ }
+ }
Review Comment:
seems like repeated logic, could this be handled with `getDefinitionLevels`?
##########
ql/src/test/queries/clientpositive/parquet_struct_with_null_vectorization.q:
##########
@@ -0,0 +1,31 @@
+SET hive.vectorized.execution.enabled=true;
+set hive.vectorized.execution.reduce.enabled=true;
+SET hive.fetch.task.conversion=none;
+
+CREATE TABLE test_parquet_struct_nulls (
+ id INT,
+ st_prim STRUCT<x:INT, y:INT>
+) STORED AS PARQUET;
+
+INSERT INTO test_parquet_struct_nulls VALUES
+ (1, named_struct('x', CAST(NULL AS INT), 'y', CAST(NULL AS INT))),
+ (2, if(1=0, named_struct('x', 1, 'y', 1), null)),
+ (3, named_struct('x', 3, 'y', CAST(NULL AS INT))),
+ (4, named_struct('x', 4, 'y', 4));
+
+-- Test A: Full table scan to check JSON representation
+SELECT * FROM test_parquet_struct_nulls ORDER BY id;
+
+-- Test B: Verify IS NULL evaluates correctly
+SELECT id FROM test_parquet_struct_nulls WHERE st_prim IS NULL;
+
+-- Test C: Verify IS NOT NULL evaluates correctly
+SELECT id FROM test_parquet_struct_nulls WHERE st_prim IS NOT NULL ORDER BY id;
+
+-- Test D: Verify field-level null evaluation inside a valid struct
+SELECT id FROM test_parquet_struct_nulls WHERE st_prim IS NOT NULL AND
st_prim.x IS NULL;
+
+-- Validate withou vectorization
Review Comment:
nit: `without`
##########
ql/src/test/queries/clientpositive/parquet_struct_with_null_vectorization.q:
##########
@@ -0,0 +1,31 @@
+SET hive.vectorized.execution.enabled=true;
+set hive.vectorized.execution.reduce.enabled=true;
+SET hive.fetch.task.conversion=none;
+
+CREATE TABLE test_parquet_struct_nulls (
+ id INT,
+ st_prim STRUCT<x:INT, y:INT>
+) STORED AS PARQUET;
+
+INSERT INTO test_parquet_struct_nulls VALUES
+ (1, named_struct('x', CAST(NULL AS INT), 'y', CAST(NULL AS INT))),
+ (2, if(1=0, named_struct('x', 1, 'y', 1), null)),
+ (3, named_struct('x', 3, 'y', CAST(NULL AS INT))),
+ (4, named_struct('x', 4, 'y', 4));
+
+-- Test A: Full table scan to check JSON representation
+SELECT * FROM test_parquet_struct_nulls ORDER BY id;
+
+-- Test B: Verify IS NULL evaluates correctly
+SELECT id FROM test_parquet_struct_nulls WHERE st_prim IS NULL;
+
+-- Test C: Verify IS NOT NULL evaluates correctly
+SELECT id FROM test_parquet_struct_nulls WHERE st_prim IS NOT NULL ORDER BY id;
+
+-- Test D: Verify field-level null evaluation inside a valid struct
+SELECT id FROM test_parquet_struct_nulls WHERE st_prim IS NOT NULL AND
st_prim.x IS NULL;
+
+-- Validate withou vectorization
+SET hive.vectorized.execution.enabled=true;
+SET hive.vectorized.execution.reduce.enabled=false;
+SELECT * FROM test_parquet_struct_nulls ORDER BY id;
Review Comment:
```
-- SORT_QUERY_RESULTS
```
##########
ql/src/test/queries/clientpositive/parquet_struct_with_null_vectorization.q:
##########
@@ -0,0 +1,31 @@
+SET hive.vectorized.execution.enabled=true;
+set hive.vectorized.execution.reduce.enabled=true;
+SET hive.fetch.task.conversion=none;
+
+CREATE TABLE test_parquet_struct_nulls (
+ id INT,
+ st_prim STRUCT<x:INT, y:INT>
+) STORED AS PARQUET;
+
+INSERT INTO test_parquet_struct_nulls VALUES
+ (1, named_struct('x', CAST(NULL AS INT), 'y', CAST(NULL AS INT))),
+ (2, if(1=0, named_struct('x', 1, 'y', 1), null)),
+ (3, named_struct('x', 3, 'y', CAST(NULL AS INT))),
+ (4, named_struct('x', 4, 'y', 4));
+
+-- Test A: Full table scan to check JSON representation
+SELECT * FROM test_parquet_struct_nulls ORDER BY id;
+
+-- Test B: Verify IS NULL evaluates correctly
+SELECT id FROM test_parquet_struct_nulls WHERE st_prim IS NULL;
+
+-- Test C: Verify IS NOT NULL evaluates correctly
+SELECT id FROM test_parquet_struct_nulls WHERE st_prim IS NOT NULL ORDER BY id;
Review Comment:
instead of ORDER BY, what about:
```
-- SORT_QUERY_RESULTS
```
##########
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedPrimitiveColumnReader.java:
##########
@@ -64,6 +64,8 @@ public void readBatch(
int total,
ColumnVector column,
TypeInfo columnType) throws IOException {
+ this.currentDefLevels = new int[total];
+ this.defLevelIndex = 0;
Review Comment:
can you elaborate on this change: I mean, I can see that passing and
handling definition level to the struct reader solves the current issue, but
this looks like fixing another bug, is it the case?
##########
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedParquetRecordReader.java:
##########
Review Comment:
is the same problem applies to another complex types? better to have more
test cases and maybe even rename the qfile to `parquet_complex_...`, maybe
`parquet_complex_types_null_vectorization.q` to clearly distinguish from the
already existing `parquet_complex_types_vectorization.q`
e.g.:
LIST:
```
NULL vs. [null, null] or [1, null]
```
or MAP ("same" as struct but not fixed schema)
```
NULL vs. { "a": 1, "b": null } and { "a": 1, "b": null, "c": null }
```
or even nested struct to validate the logic on deeper recursive callpaths:
```
CREATE TABLE test_parquet_nested_struct_nulls (
id INT,
st_prim STRUCT<x:INT, y:INT>,
st_nested STRUCT<x:INT, y:STRUCT<v:INT, w:INT>>
) STORED AS PARQUET;
```
and nested data can contain NULL on different levels where you patch is
actually hit I assume, e.g.:
```
NULL
{x: 1, y: NULL}
{x: 1, y: {v: 2, w: NULL}
{x: 1, y: {v: 2, w: 3}
```
I would appreciate a lot if this patch could validate all of these
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]