[I] "error encountered during schema visitor: arrow/array: index out of range" while reading S3 table metadata [iceberg-go]

via GitHub Sun, 24 Aug 2025 12:40:22 -0700


aslatter opened a new issue, #547:
URL: https://github.com/apache/iceberg-go/issues/547


   ### Apache Iceberg version
   
   None
   
   ### Please describe the bug 🐞
   
   [schema.json](https://github.com/user-attachments/files/21958694/schema.json)
   
   
[parquet_data.zip](https://github.com/user-attachments/files/21958704/parquet_data.zip)
   
   I don't understand the bug, so I'll describe what I was doing to run into 
it, and I'll also show my attempt at a minimal repro.
   
   I had enabled "S3 Table Metadata", which stores meta-data about an S3 bucket 
in an iceberg table. whenever I tried to read from the tables, when performing 
a scan of the table using iceberg-go, I would get "error encountered during 
schema visitor: arrow/array: index out of range" errors.
   
   These errors go away if I apply the following patch to `arrow-go`:
   
   ```patch
   diff --git a/arrow/array/struct.go b/arrow/array/struct.go
   index 6883712c..4f943ac6 100644
   --- a/arrow/array/struct.go
   +++ b/arrow/array/struct.go
   @@ -192,7 +192,7 @@ func (a *Struct) setData(data *Data) {
           a.fields = make([]arrow.Array, len(data.childData))
           for i, child := range data.childData {
                   if data.offset != 0 || child.Len() != data.length {
   -                       sub := NewSliceData(child, int64(data.offset), 
int64(data.offset+data.length))
   +                       sub := NewSliceData(child, int64(data.offset), 
int64(data.offset+child.Len()))
                           a.fields[i] = MakeFromData(sub)
                           sub.Release()
                   } else {
   ```
   
   I'm just banging rocks together - I have no idea if this patch is a good 
idea or not.
   
   In an attempt to produce a minimal repro, I grabbed the iceberg schema and 
raw parquet data from the table, and loaded them with the following go program:
   
   <details>
     <summary>Click to expand</summary>
   
   ```go
   package main
   
   import (
        "context"
        "encoding/json"
        "flag"
        "fmt"
        "os"
   
        "github.com/apache/iceberg-go"
        "github.com/apache/iceberg-go/table"
   
        "github.com/apache/arrow-go/v18/arrow/memory"
        "github.com/apache/arrow-go/v18/parquet"
        pfile "github.com/apache/arrow-go/v18/parquet/file"
        "github.com/apache/arrow-go/v18/parquet/pqarrow"
   )
   
   func main() {
        if err := mainErr(); err != nil {
                fmt.Fprintln(os.Stdout, "error:", err)
                os.Exit(1)
        }
   }
   
   func mainErr() error {
        var parquetfile string
        var icebergSchemaFile string
        flag.StringVar(&parquetfile, "parquet-file", "", "parquet file to load")
        flag.StringVar(&icebergSchemaFile, "iceberg-schema-file", "", "iceberg 
schema to use to interpret parquet file")
        flag.Parse()
   
        var schema iceberg.Schema
   
        schemaBytes, err := os.ReadFile(icebergSchemaFile)
        if err != nil {
                return fmt.Errorf("reading iceberg schema file: %s", err)
        }
   
        err = json.Unmarshal(schemaBytes, &schema)
        if err != nil {
                return fmt.Errorf("unmarhcalling iceberg schema: %s", err)
        }
   
        parquetFile, err := os.Open(parquetfile)
        if err != nil {
                return fmt.Errorf("opening parquet file: %s")
        }
   
        ctx := context.Background()
   
        return dumpParquetWithSchema(ctx, parquetFile, &schema)
   }
   
   func dumpParquetWithSchema(ctx context.Context, file parquet.ReaderAtSeeker, 
schema *iceberg.Schema) error {
   
        pr, err := pfile.NewParquetReader(file)
        if err != nil {
                return fmt.Errorf("creating parquet reader: %s", err)
        }
        defer pr.Close()
   
        arrowReader, err := pqarrow.NewFileReader(pr, 
pqarrow.ArrowReadProperties{
                BatchSize: 1024,
        }, memory.DefaultAllocator)
        if err != nil {
                return fmt.Errorf("creating pqarrow file reader: %s", err)
        }
   
        ar, err := arrowReader.GetRecordReader(ctx, nil, nil)
        if err != nil {
                return fmt.Errorf("creating arrow reader: %s", err)
        }
        defer ar.Release()
   
        arrowFileSchema := ar.Schema()
        fileSchema, err := table.ArrowSchemaToIceberg(arrowFileSchema, false, 
nil)
        if err != nil {
                return fmt.Errorf("convert parquet file schema to iceberg 
schema: %s", err)
        }
   
        for ar.Next() {
                rec := ar.Record()
                newRec, err := table.ToRequestedSchema(ctx, schema, fileSchema, 
rec, false, false, false)
                if err != nil {
                        return fmt.Errorf("table to requested schema: %s", err)
                }
                err = json.NewEncoder(os.Stdout).Encode(newRec)
                if err != nil {
                        return fmt.Errorf("encoding arrow rec as json: %s", err)
                }
                rec.Release()
        }
   
        return nil
   }
   ```
   
   </details>
   
   It fails with the same error as the table-scan described above, and the same 
patch to arrow-go makes the error go away.
   
   However if I do a raw dump of the parquet file with arrow-go without trying 
to apply the iceberg-schema to it, I don't get any errors.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] "error encountered during schema visitor: arrow/array: index out of range" while reading S3 table metadata [iceberg-go]

Reply via email to