rudolfbyker opened a new issue, #44397: URL: https://github.com/apache/arrow/issues/44397
### Describe the enhancement requested It would be nice to have another step up from `promote_options="permissive"`, e.g., `promote_options="union"` which uses dense unions when columns are heterogeneous across schemas. For example: ```py from pyarrow import table, concat_tables t1 = table({"a": [1, 2, 3]}) t2 = table({"a": ["a", "b", "c"]}) concat_tables(tables=[t1, t2], promote_options="permissive") # Currently raises `ArrowTypeError`. concat_tables(tables=[t1, t2], promote_options="union") # Does not exist at the moment. ``` The latter should use a dense union for column "a". I've implemented this myself, but it's hard to do, because there is no `is_mergeable` function which exposes the logic used by `concat_tables(tables=…, promote_options="permissive")` for me to use, causing me to have to re-implement that, either using guesswork, or using lots of `try-except`s. Here is a rough attempt, which works for some cases, but not all. It also does not preserve metadata, nor support missing columns: ```py from itertools import chain from logging import getLogger from typing import Sequence from pyarrow import ( Table, concat_tables, ArrowTypeError, table, chunked_array, ArrowInvalid, UnionArray, array, int8, int32, ) logger = getLogger(__name__) def concat_tables_heterogeneous(tables: Sequence[Table]) -> Table: """ Concatenate multiple tables vertically. This is similar to `pyarrow.concat_tables`, but it allows for heterogeneous schemas by using dense unions. """ try: return concat_tables(tables=tables, promote_options="permissive") except ArrowTypeError: logger.warning( "Heterogeneous table schemas detected. " "Some columns will be represented as dense unions, which are slower." ) # TODO: Ask the `pyarrow` maintainers to give us a `is_mergeable` function that we can use the check which columns # are mergeable without using dense unions, instead of maintaining our own heuristics here. it = iter(tables) column_names = next(it).column_names for t in it: if t.column_names != column_names: raise NotImplementedError( "The tables don't all have the same column names." ) result = {} for column_name in column_names: try: result[column_name] = chunked_array([t[column_name] for t in tables]) except ArrowInvalid: # These can't be concatenated into a normal `ChunkedArray`. Use a dense union. result[column_name] = UnionArray.from_dense( array( list(chain(*([i] * t.num_rows for i, t in enumerate(tables)))), type=int8(), ), array( list(chain(*(range(t.num_rows) for t in tables))), type=int32(), ), [array(t[column_name]) for t in tables], ) return table(data=result) ``` ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org