notfilippo opened a new issue, #41993:
URL: https://github.com/apache/arrow/issues/41993

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I've stumbled upon this bug while playing around with the results returned 
from [datafusion](https://github.com/apache/datafusion). It seems that in 
certain scenarios multiple arrays from different record batches, representing 
the same column, might share their value buffer.  The problem arises when 
trying to write those records using an `ipc.Writer`: If one of the columns 
which has a single shared buffer happens to be of a binary type the written 
value is invalid. 
   
   This happens because each array's value buffer [gets truncated to only the 
part referred to by its 
offsets](https://github.com/apache/arrow/blob/9ee6ea701e20d1b47934f977d87811624061d597/go/arrow/ipc/writer.go#L592)
 but the offsets are never updated, so now they potentially point to memory 
outside of the truncated values. When trying to read the binary-like array via 
the `ipc.Reader` an error gets returned: `string offsets out of bounds of data 
buffer`.
   
   How to reproduce:
   
   ```go
   func main() {
        var buf bytes.Buffer
        buf.WriteString("apple")
        buf.WriteString("pear")
        buf.WriteString("banana")
        values := buf.Bytes()
   
        offsets := []int32{5, 9, 15} // <-- only "pear" and "banana"
        voffsets := arrow.Int32Traits.CastToBytes(offsets)
   
        validity := []byte{0}
        bitutil.SetBit(validity, 0)
        bitutil.SetBit(validity, 1)
   
        data := array.NewData(
                arrow.BinaryTypes.String,
                2, // <-- only "pear" and "banana"
                []*memory.Buffer{
                        memory.NewBufferBytes(validity),
                        memory.NewBufferBytes(voffsets),
                        memory.NewBufferBytes(values),
                },
                nil,
                0,
                0,
        )
   
        str := array.NewStringData(data)
        fmt.Println(str) // outputs: ["pear" "banana"]
   
        schema := arrow.NewSchema([]arrow.Field{
                {
                        Name:     "string",
                        Type:     arrow.BinaryTypes.String,
                        Nullable: true,
                },
        }, nil)
        record := array.NewRecord(schema, []arrow.Array{str}, 2)
   
        var output bytes.Buffer
        writer := ipc.NewWriter(&output, ipc.WithSchema(schema))
   
        err := writer.Write(record)
        if err != nil {
                log.Fatal(err)
        }
   
        err = writer.Close()
        if err != nil {
                log.Fatal(err) 
        }
   
        reader, err := ipc.NewReader(bytes.NewReader(output.Bytes()), 
ipc.WithSchema(schema))
        if err != nil {
                log.Fatal(err)
        }
   
        reader.Next()
        if reader.Err() != nil {
                log.Fatal(reader.Err()) // string offsets out of bounds of data 
buffer
        }
   }
   
   ```
   
   ### Component(s)
   
   Go


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to