rudolfbyker opened a new issue, #44397:
URL: https://github.com/apache/arrow/issues/44397
### Describe the enhancement requested
It would be nice to have another step up from
`promote_options="permissive"`, e.g., `promote_options="union"` which uses
dense unions when columns are heterogeneous across schemas. For example:
```py
from pyarrow import table, concat_tables
t1 = table({"a": [1, 2, 3]})
t2 = table({"a": ["a", "b", "c"]})
concat_tables(tables=[t1, t2], promote_options="permissive") # Currently
raises `ArrowTypeError`.
concat_tables(tables=[t1, t2], promote_options="union") # Does not exist at
the moment.
```
The latter should use a dense union for column "a".
I've implemented this myself, but it's hard to do, because there is no
`is_mergeable` function which exposes the logic used by
`concat_tables(tables=…, promote_options="permissive")` for me to use, causing
me to have to re-implement that, either using guesswork, or using lots of
`try-except`s. Here is a rough attempt, which works for some cases, but not
all. It also does not preserve metadata, nor support missing columns:
```py
from itertools import chain
from logging import getLogger
from typing import Sequence
from pyarrow import (
Table,
concat_tables,
ArrowTypeError,
table,
chunked_array,
ArrowInvalid,
UnionArray,
array,
int8,
int32,
)
logger = getLogger(__name__)
def concat_tables_heterogeneous(tables: Sequence[Table]) -> Table:
"""
Concatenate multiple tables vertically.
This is similar to `pyarrow.concat_tables`, but it allows for
heterogeneous schemas by using dense unions.
"""
try:
return concat_tables(tables=tables, promote_options="permissive")
except ArrowTypeError:
logger.warning(
"Heterogeneous table schemas detected. "
"Some columns will be represented as dense unions, which are
slower."
)
# TODO: Ask the `pyarrow` maintainers to give us a `is_mergeable`
function that we can use the check which columns
# are mergeable without using dense unions, instead of maintaining our
own heuristics here.
it = iter(tables)
column_names = next(it).column_names
for t in it:
if t.column_names != column_names:
raise NotImplementedError(
"The tables don't all have the same column names."
)
result = {}
for column_name in column_names:
try:
result[column_name] = chunked_array([t[column_name] for t in
tables])
except ArrowInvalid:
# These can't be concatenated into a normal `ChunkedArray`. Use
a dense union.
result[column_name] = UnionArray.from_dense(
array(
list(chain(*([i] * t.num_rows for i, t in
enumerate(tables)))),
type=int8(),
),
array(
list(chain(*(range(t.num_rows) for t in tables))),
type=int32(),
),
[array(t[column_name]) for t in tables],
)
return table(data=result)
```
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]