tosinva-stripe opened a new issue, #44806:
URL: https://github.com/apache/arrow/issues/44806

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   LargeBinary and LargeString use int64 offsets, however Binary and String 
types use int32 offsets, this makes them susceptible to slice index out of 
bounds errors when the column/array is larger than ~2GB ~= 2^31 bytes.
   
   To reproduce try deserializing a parquet file that is greater than 2.2 GB. 
   
   A workaround is to force the go library to deserialize the field/column as 
LargeBinary instead of Binary:
   - explicitly store the arrow schema during write. see `store_schema` 
https://arrow.apache.org/docs/cpp/parquet.html#roundtripping-arrow-types-and-schema
 
   - and schema explicitly uses the 
[large_binary](https://arrow.apache.org/docs/cpp/api/datatype.html#_CPPv412large_binaryv)
 or large_string type when defining the schema that is used to write the 
parquet files.
   
   Error looks like:
   ```go
   panic: runtime error: slice bounds out of range [:-2147483014]
   
   goroutine 95 [running]:
   github.com/apache/arrow/go/v17/arrow/array.(*Binary).Value(...)
        
/go/pkg/mod/github.com/apache/arrow/go/v17@v17.0.0/arrow/array/binary.go:59
   github.com/apache/arrow/go/v17/arrow/array.(*Binary).ValueStr(0xc000178d20?, 
0xc091402a00?)
        
/go/pkg/mod/github.com/apache/arrow/go/v17@v17.0.0/arrow/array/binary.go:67 
+0xfa
   extractorvalidator/data.BootstrapRecordsFromParquet({0x1de1a40, 
0xcc6a9775f0}, 0x0)
        /.../data/records.go:78 +0x582
   main.validationWorker({0x1dccd90, 0x2c31840}, 0x0?, {0x0?}, 0xc0000315e0, 
0xc000001de0, 0xc0000fe9c0)
        /.../command.go:428 +0x125
   created by main.RunValidateCmd in goroutine 1
        /.../command.go:174 +0xb90
   ```
   
   
    version and platform
   ```
   Arrow Version: github.com/apache/arrow/go/v17 v17.0.0
   Platform: Linux 20.04.1-Ubuntu  x86_64 x86_64 x86_64 GNU/Linux
   ```
   
   
   
   
   ### Component(s)
   
   Go


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to