[I] Improving RLE performance by generics [arrow-go]

via GitHub Thu, 14 Aug 2025 06:33:37 -0700


daniel-adam-tfs opened a new issue, #472:
URL: https://github.com/apache/arrow-go/issues/472


   ### Describe the enhancement requested
   
   
   We’re migrating some data from a internal format to parquets, and as I was 
profiling our new code I discovered a source of a large number of allocations: 
the casting of a `[]T` to an `interface{}`. 
   
   In our data we have RLE encoding for some columns and it seems that for our 
data the `DictionaryConverter` methods get called a lot. (I’ve tested this with 
a parquet file with 10 000 000 rows overall, and for example the Fill method 
gets called 60 000+ times on one column and 100 000+ times on another). The 
interface methods all use `interface{}` parameters and the casting from a typed 
slice to an `interface{}` results in a heap allocation. Overall this causes a 
significant overhead.
   
   Using generics for the interface avoids the allocations. In my fork - 
https://github.com/daniel-adam-tfs/arrow-go- I tested some modifications:
   
   * I’ve created a generic interface:
   ```go
   type DictionaryConverter2[T int32 | int64 | float32 | float64] interface {
        Copy2([]T, []IndexType) error
        Fill2([]T, IndexType) error
        FillZero2([]T)
        IsValid(...IndexType) bool
        IsValidSingle(IndexType) bool
   }
   ```
   * I’ve created `IsValidSingle` because when using a single value for 
variadic parameter then the runtime allocates a slice on the heap and this 
method again gets called a lot for our data.
   * I’ve added the methods from the `DictionaryConverter2` to the 
dictConverter[T] and the simply casted `DictionaryConverter` to 
`DictionaryConverter2` before use and invoked the new methods instead of the 
old ones.
   
   I’ve used to PyArrow to generate a parquet file with 10 000 000 rows, with 
max 1000 unique values and a value has repeating runs 1-50 and here are the 
results:
   
   Implementation using DictionaryConverter:
   ```
   % benchstat testdata/BenchmarkInt32Column.txt 
   goos: darwin
   goarch: amd64
   pkg: github.com/apache/arrow-go/v18/parquet/file
   cpu: Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz
                 │ testdata/BenchmarkInt32Column.txt │
                 │              sec/op               │
   Int32Column-8                         86.50m ± 6%
   
                 │ testdata/BenchmarkInt32Column.txt │
                 │               B/op                │
   Int32Column-8                        12.88Mi ± 0%
   
                 │ testdata/BenchmarkInt32Column.txt │
                 │             allocs/op             │
   Int32Column-8                         712.9k ± 0%
   ```
   
   Implementation using DictionaryConverter2:
   ```
   % benchstat testdata/BenchmarkInt32Column.new.txt  
   goos: darwin
   goarch: amd64
   pkg: github.com/apache/arrow-go/v18/parquet/file
   cpu: Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz
                 │ testdata/BenchmarkInt32Column.new.txt │
                 │                sec/op                 │
   Int32Column-8                             61.27m ± 7%
   
                 │ testdata/BenchmarkInt32Column.new.txt │
                 │                 B/op                  │
   Int32Column-8                            2.932Mi ± 0%
   
                 │ testdata/BenchmarkInt32Column.new.txt │
                 │               allocs/op               │
   Int32Column-8                             1.301k ± 0%
   ```
   
   Comparison by benchstat:
   ```
   % benchstat testdata/BenchmarkInt32Column.txt 
testdata/BenchmarkInt32Column.new.txt  
   goos: darwin
   goarch: amd64
   pkg: github.com/apache/arrow-go/v18/parquet/file
   cpu: Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz
                 │ testdata/BenchmarkInt32Column.txt │ 
testdata/BenchmarkInt32Column.new.txt │
                 │              sec/op               │     sec/op      vs base  
             │
   Int32Column-8                         86.50m ± 6%      61.27m ± 7%  -29.16% 
(p=0.000 n=8)
   
                 │ testdata/BenchmarkInt32Column.txt │ 
testdata/BenchmarkInt32Column.new.txt │
                 │               B/op                │      B/op       vs base  
             │
   Int32Column-8                       12.878Mi ± 0%     2.932Mi ± 0%  -77.23% 
(p=0.000 n=8)
   
                 │ testdata/BenchmarkInt32Column.txt │ 
testdata/BenchmarkInt32Column.new.txt │
                 │             allocs/op             │   allocs/op     vs base  
             │
   Int32Column-8                       712.870k ± 0%      1.301k ± 0%  -99.82% 
(p=0.000 n=8)
   ```
   
   So it got rid of 99%+ of allocations for this use case, which is pretty good 
I’d say. 
   
   I can take a stab at it, but rewriting DictionaryConverter to 
DictionaryConverter[T] might end up being a significant amount of code changes, 
so I wanna know what do maintainers think about it.
   
   ### Component(s)
   
   Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Improving RLE performance by generics [arrow-go]

Reply via email to