ayushtkn commented on code in PR #6408: URL: https://github.com/apache/hive/pull/6408#discussion_r3052846212
########## ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedParquetRecordReader.java: ########## Review Comment: The Struct value turning to `Null`, if sub fields are `NULL` was due to specific code in `VectorizedStructReader` https://github.com/apache/hive/blob/13f3208c01fec2d20108302efc3fd033d1d76a19/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedStructColumnReader.java#L50-L55 It used to check, if all fields have `null`, it used to set `null`, that I fixed. I checked the LIST & MAP. They have another bug :-) In LIST & MAP, If you insert `NULL` in them it defaults to `0` eg. ``` CREATE TABLE test_parquet_array_nulls ( > id INT, > arr_prim ARRAY<INT> > ) STORED AS PARQUET; > > INSERT INTO test_parquet_array_nulls VALUES > -- 1: Array exists, but all elements inside are NULL > (1, array(CAST(NULL AS INT), CAST(NULL AS INT))), > > -- 2: The Array itself is strictly NULL > (2, if(1=0, array(1, 2), null)), > > -- 3: Array exists, containing a mix of valid and NULL elements > (3, array(3, CAST(NULL AS INT))), > > -- 4: Array exists, all elements are valid > (4, array(4, 5)); > > SELECT * FROM test_parquet_array_nulls ORDER BY id; ``` It outputs ``` +------------------------------+------------------------------------+ | test_parquet_array_nulls.id | test_parquet_array_nulls.arr_prim | +------------------------------+------------------------------------+ | 1 | [0,0] | | 2 | NULL | | 3 | [3,0] | | 4 | [4,5] | +------------------------------+------------------------------------+ ``` Disabling vectorization gives correct ``` +------------------------------+------------------------------------+ | test_parquet_array_nulls.id | test_parquet_array_nulls.arr_prim | +------------------------------+------------------------------------+ | 1 | [null,null] | | 2 | NULL | | 3 | [3,null] | | 4 | [4,5] | +------------------------------+------------------------------------+ ``` This seems some different bug, like every `NULL` is treated as 0. Will it be ok, if we chase this in a different ticket. I believe it is some where it is returning default value of int instead `NULL`, some check is wrong which I have to debug. Regarding nested, vectorization is disabled so, that doesn't kick in: https://issues.apache.org/jira/browse/HIVE-19016 For map we already have a test: https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/parquet_map_null_vectorization.q -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
