[I] `RunEndEncodeTableColumns` doesn't change the table's schema type to reflect that the column is run-end-encoded [arrow]

via GitHub Thu, 13 Feb 2025 14:47:03 -0800


lesterfan opened a new issue, #45534:
URL: https://github.com/apache/arrow/issues/45534


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   This is half a bug report regarding the 
[RunEndEncodeTableColumns](https://github.com/apache/arrow/blob/6a47e4d28cdc4592fe6a458dbe5efe3b17a090e5/cpp/src/arrow/testing/gtest_util.cc#L476-L492)
 gtest util and half a usage question.
   
   If a string column in an `arrow::Table` is run-end encoded, should the 
corresponding schema type be `arrow::utf8()` or 
`arrow::run_end_encoded(arrow::int32(), arrow::utf8())`? The 
[RunEndEncodeTableColumns](https://github.com/apache/arrow/blob/6a47e4d28cdc4592fe6a458dbe5efe3b17a090e5/cpp/src/arrow/testing/gtest_util.cc#L476-L492)
 gtest util currently returns a table like 
   ```
   ree_table = col: string
   ----
   col:
     [
   
       -- run_ends:
         [
           1,
           2,
           3,
           4
         ]
       -- values:
         [
           "a",
           "b",
           "c",
           "d"
         ]
     ]
   ```
   whereas I would have expected a table like
   ```
   ree_table = col: run_end_encoded<run_ends: int32, values: string>
     child 0, run_ends: int32 not null
     child 1, values: string
   ----
   col:
     [
   
       -- run_ends:
         [
           1,
           2,
           3,
           4
         ]
       -- values:
         [
           "a",
           "b",
           "c",
           "d"
         ]
     ]
   ```
   I'm not sure which is more correct here. My instinct is that the second is 
more correct since I see in the codebase that certain features are disabled for 
run-end-encoded types 
([example](https://github.com/apache/arrow/blob/6a47e4d28cdc4592fe6a458dbe5efe3b17a090e5/python/pyarrow/src/arrow/python/arrow_to_pandas.cc#L1383)),
 so we would want the schema to be accurate to reflect what the library 
currently supports on the column. I definitely don't have a lot of context here 
though, so I may be missing something 🙂 
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] `RunEndEncodeTableColumns` doesn't change the table's schema type to reflect that the column is run-end-encoded [arrow]

Reply via email to