This is an automated email from the ASF dual-hosted git repository.

kou pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/main by this push:
     new 6ee7f7ecdc GH-48481: [Ruby] Correctly infer types for nested integer 
arrays (#48699)
6ee7f7ecdc is described below

commit 6ee7f7ecdc4b79b7882cdc8f47828964e3ba6a6f
Author: hypsakata <[email protected]>
AuthorDate: Sat Jan 3 09:58:34 2026 +0900

    GH-48481: [Ruby] Correctly infer types for nested integer arrays (#48699)
    
    ### Rationale for this change
    
    When building an `Arrow::Table` from a Ruby Hash passed to 
`Arrow::Table.new`, nested Integer arrays are incorrectly inferred as 
`list<uint8>` or `list<int8>` regardless of the actual values contained. Nested 
integer arrays should be correctly inferred as the appropriate list type (e.g., 
`list<int64>`, `list<uint64>`) based on their values, similar to how flat 
arrays are handled, unless they contain values out of range for any integer 
type.
    
    ### What changes are included in this PR?
    
    This PR modifies the logic in `detect_builder_info()` to fix the inference 
issue. Specifically:
    
    - **Persist `sub_builder_info` across sub-array elements**: Previously, 
`sub_builder_info` was recreated for each sub-array element in the Array. The 
logic has been updated to accumulate and carry over the builder information 
across elements to ensure correct type inference for the entire list.
    - **Refactor Integer builder logic**: Following the pattern used for 
`BigDecimal`, the logic for determining the Integer builder has been moved to 
`create_builder()`. `detect_builder_info()` now calls this function.
    
    **Note:**
    
    - As a side effect of this refactoring, nested lists of `BigDecimal` (which 
were previously inferred as `string`) may now have their types inferred. 
However, comprehensive testing and verification for nested `BigDecimal` support 
will be addressed in a separate issue to keep this PR focused.
    - We stopped using `IntArrayBuilder` for inference logic to ensure 
correctness. This results in a performance overhead (array building is 
approximately 2x slower) as we can no longer rely on the specialized builder's 
detection.
    
    ```text
                                               user     system      total       
 real
        array_builder int32 100000         0.085867   0.000194   0.086061 (  
0.086369)
    int_array_builder int32 100000         0.042163   0.001033   0.043196 (  
0.043268)
        array_builder int64 100000         0.086799   0.000015   0.086814 (  
0.086828)
    int_array_builder int64 100000         0.044493   0.000973   0.045466 (  
0.045469)
        array_builder uint32 100000        0.085748   0.000009   0.085757 (  
0.085768)
    int_array_builder uint32 100000        0.044463   0.001034   0.045497 (  
0.045498)
        array_builder uint64 100000        0.084548   0.000987   0.085535 (  
0.085537)
    int_array_builder uint64 100000        0.044206   0.000017   0.044223 (  
0.044225)
    ```
    
    ### Are these changes tested?
    
    Yes. `ruby ruby/red-arrow/test/run-test.rb`
    
    ### Are there any user-facing changes?
    
    Yes.
    
    * GitHub Issue: #48481
    
    Authored-by: hypsakata <[email protected]>
    Signed-off-by: Sutou Kouhei <[email protected]>
---
 ruby/red-arrow/lib/arrow/array-builder.rb |  53 +++-
 ruby/red-arrow/test/test-array-builder.rb | 434 +++++++++++++++++++++++++++---
 2 files changed, 443 insertions(+), 44 deletions(-)

diff --git a/ruby/red-arrow/lib/arrow/array-builder.rb 
b/ruby/red-arrow/lib/arrow/array-builder.rb
index 2ccf50f3c1..5bb1ee7456 100644
--- a/ruby/red-arrow/lib/arrow/array-builder.rb
+++ b/ruby/red-arrow/lib/arrow/array-builder.rb
@@ -74,14 +74,23 @@ module Arrow
             detected: true,
           }
         when Integer
-          if value < 0
+          builder_info ||= {}
+          min = builder_info[:min] || value
+          max = builder_info[:max] || value
+          min = value if value < min
+          max = value if value > max
+
+          if builder_info[:builder_type] == :int || value < 0
             {
-              builder: IntArrayBuilder.new,
-              detected: true,
+              builder_type: :int,
+              min: min,
+              max: max,
             }
           else
             {
-              builder: UIntArrayBuilder.new,
+              builder_type: :uint,
+              min: min,
+              max: max,
             }
           end
         when Time
@@ -150,18 +159,19 @@ module Arrow
             end
           end
         when ::Array
-          sub_builder_info = nil
+          sub_builder_info = builder_info && builder_info[:value_builder_info]
           value.each do |sub_value|
             sub_builder_info = detect_builder_info(sub_value, sub_builder_info)
             break if sub_builder_info and sub_builder_info[:detected]
           end
           if sub_builder_info
-            sub_builder = sub_builder_info[:builder]
-            return builder_info unless sub_builder
+            sub_builder = sub_builder_info[:builder] || 
create_builder(sub_builder_info)
+            return sub_builder_info unless sub_builder
             sub_value_data_type = sub_builder.value_data_type
             field = Field.new("item", sub_value_data_type)
             {
               builder: ListArrayBuilder.new(ListDataType.new(field)),
+              value_builder_info: sub_builder_info,
               detected: sub_builder_info[:detected],
             }
           else
@@ -186,6 +196,35 @@ module Arrow
           data_type = Decimal256DataType.new(builder_info[:precision],
                                              builder_info[:scale])
           Decimal256ArrayBuilder.new(data_type)
+        when :int
+          min = builder_info[:min]
+          max = builder_info[:max]
+
+          if GLib::MININT8 <= min && max <= GLib::MAXINT8
+            Int8ArrayBuilder.new
+          elsif GLib::MININT16 <= min && max <= GLib::MAXINT16
+            Int16ArrayBuilder.new
+          elsif GLib::MININT32 <= min && max <= GLib::MAXINT32
+            Int32ArrayBuilder.new
+          elsif GLib::MININT64 <= min && max <= GLib::MAXINT64
+            Int64ArrayBuilder.new
+          else
+            StringArrayBuilder.new
+          end
+        when :uint
+          max = builder_info[:max]
+
+          if max <= GLib::MAXUINT8
+            UInt8ArrayBuilder.new
+          elsif max <= GLib::MAXUINT16
+            UInt16ArrayBuilder.new
+          elsif max <= GLib::MAXUINT32
+            UInt32ArrayBuilder.new
+          elsif max <= GLib::MAXUINT64
+            UInt64ArrayBuilder.new
+          else
+            StringArrayBuilder.new
+          end
         else
           nil
         end
diff --git a/ruby/red-arrow/test/test-array-builder.rb 
b/ruby/red-arrow/test/test-array-builder.rb
index 7a2d42e54b..f629eec661 100644
--- a/ruby/red-arrow/test/test-array-builder.rb
+++ b/ruby/red-arrow/test/test-array-builder.rb
@@ -147,44 +147,404 @@ class ArrayBuilderTest < Test::Unit::TestCase
                      ])
       end
 
-      test("list<uint>s") do
-        values = [
-          [0, 1, 2],
-          [3, 4],
-        ]
-        array = Arrow::Array.new(values)
-        data_type = Arrow::ListDataType.new(Arrow::UInt8DataType.new)
-        assert_equal({
-                       data_type: data_type,
-                       values: [
-                         [0, 1, 2],
-                         [3, 4],
-                       ],
-                     },
-                     {
-                       data_type: array.value_data_type,
-                       values: array.to_a,
-                     })
-      end
+      sub_test_case("nested integer list") do
+        test("list<uint8>s") do
+          values = [
+            [0, 1, 2],
+            [3, 4],
+          ]
+          array = Arrow::Array.new(values)
+          data_type = Arrow::ListDataType.new(:uint8)
+          assert_equal({
+                         data_type: data_type,
+                         values: [
+                           [0, 1, 2],
+                           [3, 4],
+                         ],
+                       },
+                       {
+                         data_type: array.value_data_type,
+                         values: array.to_a,
+                       })
+        end
 
-      test("list<int>s") do
-        values = [
-          [0, -1, 2],
-          [3, 4],
-        ]
-        array = Arrow::Array.new(values)
-        data_type = Arrow::ListDataType.new(Arrow::Int8DataType.new)
-        assert_equal({
-                       data_type: data_type,
-                       values: [
-                         [0, -1, 2],
-                         [3, 4],
-                       ],
-                     },
-                     {
-                       data_type: array.value_data_type,
-                       values: array.to_a,
-                     })
+        test("list<int8>s boundary") do
+          values = [
+            [0, GLib::MININT8],
+            [GLib::MAXINT8],
+          ]
+          array = Arrow::Array.new(values)
+          data_type = Arrow::ListDataType.new(:int8)
+
+          assert_equal({
+                         data_type: data_type,
+                         values: [
+                           [0, GLib::MININT8],
+                           [GLib::MAXINT8],
+                         ],
+                       },
+                       {
+                         data_type: array.value_data_type,
+                         values: array.to_a,
+                       })
+        end
+
+        test("list<int16>s inferred from int8 underflow") do
+          values = [
+            [0, GLib::MININT8 - 1],
+            [GLib::MAXINT8],
+          ]
+          array = Arrow::Array.new(values)
+          data_type = Arrow::ListDataType.new(:int16)
+
+          assert_equal({
+                         data_type: data_type,
+                         values: [
+                           [0, GLib::MININT8 - 1],
+                           [GLib::MAXINT8],
+                         ],
+                       },
+                       {
+                         data_type: array.value_data_type,
+                         values: array.to_a,
+                       })
+        end
+
+        test("list<int16>s inferred from int8 overflow") do
+          values = [
+            [0, GLib::MAXINT8 + 1],
+            [GLib::MININT8],
+          ]
+          array = Arrow::Array.new(values)
+          data_type = Arrow::ListDataType.new(:int16)
+
+          assert_equal({
+                         data_type: data_type,
+                         values: [
+                           [0, GLib::MAXINT8 + 1],
+                           [GLib::MININT8],
+                         ],
+                       },
+                       {
+                         data_type: array.value_data_type,
+                         values: array.to_a,
+                       })
+        end
+
+        test("list<int16>s boundary") do
+          values = [
+            [0, GLib::MININT16],
+            [GLib::MAXINT16],
+          ]
+          array = Arrow::Array.new(values)
+          data_type = Arrow::ListDataType.new(:int16)
+
+          assert_equal({
+                         data_type: data_type,
+                         values: [
+                           [0, GLib::MININT16],
+                           [GLib::MAXINT16],
+                         ],
+                       },
+                       {
+                         data_type: array.value_data_type,
+                         values: array.to_a,
+                       })
+        end
+
+        test("list<int32>s inferred from int16 underflow") do
+          values = [
+            [0, GLib::MININT16 - 1],
+            [GLib::MAXINT16],
+          ]
+          array = Arrow::Array.new(values)
+          data_type = Arrow::ListDataType.new(:int32)
+
+          assert_equal({
+                         data_type: data_type,
+                         values: [
+                           [0, GLib::MININT16 - 1],
+                           [GLib::MAXINT16],
+                         ],
+                       },
+                       {
+                         data_type: array.value_data_type,
+                         values: array.to_a,
+                       })
+        end
+
+        test("list<int32>s inferred from int16 overflow") do
+          values = [
+            [0, GLib::MAXINT16 + 1],
+            [GLib::MININT16],
+          ]
+          array = Arrow::Array.new(values)
+          data_type = Arrow::ListDataType.new(:int32)
+
+          assert_equal({
+                         data_type: data_type,
+                         values: [
+                           [0, GLib::MAXINT16 + 1],
+                           [GLib::MININT16],
+                         ],
+                       },
+                       {
+                         data_type: array.value_data_type,
+                         values: array.to_a,
+                       })
+        end
+
+        test("list<int32>s boundary") do
+          values = [
+            [0, GLib::MININT32],
+            [GLib::MAXINT32],
+          ]
+          array = Arrow::Array.new(values)
+          data_type = Arrow::ListDataType.new(:int32)
+
+          assert_equal({
+                         data_type: data_type,
+                         values: [
+                           [0, GLib::MININT32],
+                           [GLib::MAXINT32],
+                         ],
+                       },
+                       {
+                         data_type: array.value_data_type,
+                         values: array.to_a,
+                       })
+        end
+
+        test("list<int64>s inferred from int32 underflow") do
+          values = [
+            [0, GLib::MININT32 - 1],
+            [GLib::MAXINT32],
+          ]
+          array = Arrow::Array.new(values)
+          data_type = Arrow::ListDataType.new(:int64)
+
+          assert_equal({
+                         data_type: data_type,
+                         values: [
+                           [0, GLib::MININT32 - 1],
+                           [GLib::MAXINT32],
+                         ],
+                       },
+                       {
+                         data_type: array.value_data_type,
+                         values: array.to_a,
+                       })
+        end
+
+        test("list<int64>s inferred from int32 overflow") do
+          values = [
+            [0, GLib::MAXINT32 + 1],
+            [GLib::MININT32],
+          ]
+          array = Arrow::Array.new(values)
+          data_type = Arrow::ListDataType.new(:int64)
+
+          assert_equal({
+                         data_type: data_type,
+                         values: [
+                           [0, GLib::MAXINT32 + 1],
+                           [GLib::MININT32],
+                         ],
+                       },
+                       {
+                         data_type: array.value_data_type,
+                         values: array.to_a,
+                       })
+        end
+
+        test("string fallback from nested int64 array overflow") do
+          values = [
+            [0, GLib::MAXINT64 + 1],
+            [GLib::MININT64],
+          ]
+          array = Arrow::Array.new(values)
+          data_type = Arrow::ListDataType.new(:string)
+
+          assert_equal({
+                         data_type: data_type,
+                         values: [
+                           ["0", "#{GLib::MAXINT64 + 1}"],
+                           ["#{GLib::MININT64}"],
+                         ],
+                       },
+                       {
+                         data_type: array.value_data_type,
+                         values: array.to_a,
+                       })
+        end
+
+        test("string fallback from nested int64 array underflow") do
+          values = [
+            [0, GLib::MININT64 - 1],
+            [GLib::MAXINT64],
+          ]
+          array = Arrow::Array.new(values)
+          data_type = Arrow::ListDataType.new(:string)
+
+          assert_equal({
+                         data_type: data_type,
+                         values: [
+                           ["0", "#{GLib::MININT64 - 1}"],
+                           ["#{GLib::MAXINT64}"],
+                         ],
+                       },
+                       {
+                         data_type: array.value_data_type,
+                         values: array.to_a,
+                       })
+        end
+
+        test("list<uint8>s boundary") do
+          values = [
+            [0, GLib::MAXUINT8],
+            [GLib::MAXUINT8],
+          ]
+          array = Arrow::Array.new(values)
+          data_type = Arrow::ListDataType.new(:uint8)
+
+          assert_equal({
+                         data_type: data_type,
+                         values: [
+                           [0, GLib::MAXUINT8],
+                           [GLib::MAXUINT8],
+                         ],
+                       },
+                       {
+                         data_type: array.value_data_type,
+                         values: array.to_a,
+                       })
+        end
+
+        test("list<uint16>s") do
+          values = [
+            [0, GLib::MAXUINT8 + 1],
+            [GLib::MAXUINT8],
+          ]
+          array = Arrow::Array.new(values)
+          data_type = Arrow::ListDataType.new(:uint16)
+
+          assert_equal({
+                         data_type: data_type,
+                         values: [
+                           [0, GLib::MAXUINT8 + 1],
+                           [GLib::MAXUINT8],
+                         ],
+                       },
+                       {
+                         data_type: array.value_data_type,
+                         values: array.to_a,
+                       })
+        end
+
+        test("list<uint16>s boundary") do
+          values = [
+            [0, GLib::MAXUINT16],
+            [GLib::MAXUINT16],
+          ]
+          array = Arrow::Array.new(values)
+          data_type = Arrow::ListDataType.new(:uint16)
+
+          assert_equal({
+                         data_type: data_type,
+                         values: [
+                           [0, GLib::MAXUINT16],
+                           [GLib::MAXUINT16],
+                         ],
+                       },
+                       {
+                         data_type: array.value_data_type,
+                         values: array.to_a,
+                       })
+        end
+
+        test("list<uint32>s") do
+          values = [
+            [0, GLib::MAXUINT16 + 1],
+            [GLib::MAXUINT16],
+          ]
+          array = Arrow::Array.new(values)
+          data_type = Arrow::ListDataType.new(:uint32)
+
+          assert_equal({
+                         data_type: data_type,
+                         values: [
+                           [0, GLib::MAXUINT16 + 1],
+                           [GLib::MAXUINT16],
+                         ],
+                       },
+                       {
+                         data_type: array.value_data_type,
+                         values: array.to_a,
+                       })
+        end
+
+        test("list<uint32>s boundary") do
+          values = [
+            [0, GLib::MAXUINT32],
+            [GLib::MAXUINT32],
+          ]
+          array = Arrow::Array.new(values)
+          data_type = Arrow::ListDataType.new(:uint32)
+
+          assert_equal({
+                         data_type: data_type,
+                         values: [
+                           [0, GLib::MAXUINT32],
+                           [GLib::MAXUINT32],
+                         ],
+                       },
+                       {
+                         data_type: array.value_data_type,
+                         values: array.to_a,
+                       })
+        end
+
+        test("list<uint64>s") do
+          values = [
+            [0, GLib::MAXUINT32 + 1],
+            [GLib::MAXUINT32],
+          ]
+          array = Arrow::Array.new(values)
+          data_type = Arrow::ListDataType.new(:uint64)
+
+          assert_equal({
+                         data_type: data_type,
+                         values: [
+                           [0, GLib::MAXUINT32 + 1],
+                           [GLib::MAXUINT32],
+                         ],
+                       },
+                       {
+                         data_type: array.value_data_type,
+                         values: array.to_a,
+                       })
+        end
+
+        test("string fallback from nested uint64 array overflow") do
+          values = [
+            [0, GLib::MAXUINT64 + 1],
+            [GLib::MAXUINT64],
+          ]
+          array = Arrow::Array.new(values)
+          data_type = Arrow::ListDataType.new(:string)
+
+          assert_equal({
+                         data_type: data_type,
+                         values: [
+                           ["0", "#{GLib::MAXUINT64 + 1}"],
+                           ["#{GLib::MAXUINT64}"],
+                         ],
+                       },
+                       {
+                         data_type: array.value_data_type,
+                         values: array.to_a,
+                       })
+        end
       end
     end
 

Reply via email to