rudolfbyker opened a new issue, #44397:
URL: https://github.com/apache/arrow/issues/44397

   ### Describe the enhancement requested
   
   It would be nice to have another step up from 
`promote_options="permissive"`, e.g., `promote_options="union"` which uses 
dense unions when columns are heterogeneous across schemas. For example:
   
   ```py
   from pyarrow import table, concat_tables
   
   t1 = table({"a": [1, 2, 3]})
   t2 = table({"a": ["a", "b", "c"]})
   
   concat_tables(tables=[t1, t2], promote_options="permissive")  # Currently 
raises `ArrowTypeError`.
   concat_tables(tables=[t1, t2], promote_options="union")  # Does not exist at 
the moment.
   ```
   
   The latter should use a dense union for column "a".
   
   I've implemented this myself, but it's hard to do, because there is no 
`is_mergeable` function which exposes the logic used by 
`concat_tables(tables=…, promote_options="permissive")` for me to use, causing 
me to have to re-implement that, either using guesswork, or using lots of 
`try-except`s. Here is a rough attempt, which works for some cases, but not 
all. It also does not preserve metadata, nor support missing columns:
   
   ```py
   from itertools import chain
   from logging import getLogger
   from typing import Sequence
   
   from pyarrow import (
       Table,
       concat_tables,
       ArrowTypeError,
       table,
       chunked_array,
       ArrowInvalid,
       UnionArray,
       array,
       int8,
       int32,
   )
   
   logger = getLogger(__name__)
   
   
   def concat_tables_heterogeneous(tables: Sequence[Table]) -> Table:
       """
       Concatenate multiple tables vertically.
       This is similar to `pyarrow.concat_tables`, but it allows for 
heterogeneous schemas by using dense unions.
       """
       try:
           return concat_tables(tables=tables, promote_options="permissive")
       except ArrowTypeError:
           logger.warning(
               "Heterogeneous table schemas detected. "
               "Some columns will be represented as dense unions, which are 
slower."
           )
   
       # TODO: Ask the `pyarrow` maintainers to give us a `is_mergeable` 
function that we can use the check which columns
       #   are mergeable without using dense unions, instead of maintaining our 
own heuristics here.
       it = iter(tables)
       column_names = next(it).column_names
       for t in it:
           if t.column_names != column_names:
               raise NotImplementedError(
                   "The tables don't all have the same column names."
               )
   
       result = {}
       for column_name in column_names:
           try:
               result[column_name] = chunked_array([t[column_name] for t in 
tables])
           except ArrowInvalid:
               # These can't be concatenated into a normal `ChunkedArray`. Use 
a dense union.
               result[column_name] = UnionArray.from_dense(
                   array(
                       list(chain(*([i] * t.num_rows for i, t in 
enumerate(tables)))),
                       type=int8(),
                   ),
                   array(
                       list(chain(*(range(t.num_rows) for t in tables))),
                       type=int32(),
                   ),
                   [array(t[column_name]) for t in tables],
               )
   
       return table(data=result)
   ```
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to