qzyu999 opened a new issue, #839:
URL: https://github.com/apache/arrow-go/issues/839
### Describe the bug, including details regarding any error messages,
version, and platform.
# Summary:
The `valuesize()` function in `parquet/variant/utils.go` checks `(typeinfo
>> 4) & 0x1` to determine the `is_large` flag for both objects and arrays.
While this is correct for objects, it is incorrect for arrays.
According to the Parquet Variant spec, the array layout shifts the
`is_large` flag to bit position 2 of the value header, rather than bit 4.
# Root Cause Analysis
The specification defines different header layouts to optimize space for
objects vs. arrays:
## Object value_header (6 bits):
```text
Bit Position: [ 5 ] [ 4 ] [ 3 2 ] [ 1 0 ]
Data Stored: Unused is_large field_id_sz offset_sz
▲
(Correctly checks Bit 4)
```
## Array value_header (6 bits):
```text
Bit Position: [ 5 4 3 ] [ 2 ] [ 1 0 ]
Data Stored: Unused is_large offset_sz
▲
(Should check Bit 2!)
```
# Evidence
## The Bug (parquet/variant/utils.go):
```go
case byte(basicarray):
var szbytes uint8 = 1
if ((typeinfo >> 4) & 0x1) != 0 { // ❌ Error: Checks bit 4 instead of
bit 2
szbytes = 4
}
```
## The Correct Implementation (parquet/variant/variant.go):
```go
case basicarray:
valuehdr := (v.value[0] >> basictypebits)
fieldoffsetsz := (valuehdr & 0b11) + 1
islarge := ((valuehdr >> 2) & 0b1) == 1 // Correct: Checks bit 2
```
# Impact
This causes `valuesize()` to return an incorrect size for arrays using
4-byte offsets `(is_large = true)`. This leads directly to silent data
corruption or panics during writes/compactions—specifically when
`FinishObject()` compacts duplicate keys whose values happen to be large arrays.
# Suggested Fix
Update the `basicarray` case in `parquet/variant/utils.go` to shift by 2
instead of 4:
```go
case byte(basicarray):
var szbytes uint8 = 1
if ((typeinfo >> 2) & 0x1) != 0 { // Fix: Shift by 2 for arrays
szbytes = 4
}
```
### Component(s)
Parquet
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]