Aryik opened a new issue, #44164:
URL: https://github.com/apache/arrow/issues/44164
### Describe the bug, including details regarding any error messages,
version, and platform.
I was getting the following error when trying to build a large polars
DataFrame from a pyarrow table:
```
File "/app/decision_engine/loader.py", line 897, in dataframe_from_dicts
pl.from_arrow(arrow_chunks, schema=schema),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/root/.cache/pypoetry/virtualenvs/decision-engine-9TtSrW0h-py3.12/lib/python3.12/site-packages/polars/convert/general.py",
line 462, in from_arrow
arrow_to_pydf(
File
"/root/.cache/pypoetry/virtualenvs/decision-engine-9TtSrW0h-py3.12/lib/python3.12/site-packages/polars/_utils/construction/dataframe.py",
line 1195, in arrow_to_pydf
ps = plc.arrow_to_pyseries(name, column, rechunk=rechunk)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/root/.cache/pypoetry/virtualenvs/decision-engine-9TtSrW0h-py3.12/lib/python3.12/site-packages/polars/_utils/construction/series.py",
line 421, in arrow_to_pyseries
pys = PySeries.from_arrow(name, array.combine_chunks())
^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/table.pxi", line 754, in
pyarrow.lib.ChunkedArray.combine_chunks
File "pyarrow/array.pxi", line 4579, in pyarrow.lib.concat_arrays
File "pyarrow/error.pxi", line 155, in
pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
```
After some digging, I discovered that this error only occurs when you call
`combine_chunks` on an array that is a struct of strings, when the total size
of the array is above some number of bytes. In my testing, I saw the error
occur somewhere around 3,668,663,880 bytes.
Previous bug reports with the offset overflow are mostly around very large
strings. In this case, we don't have any individual string that is larger than
2GB. Instead, we get the error when we are above a certain total size. Here is
a minimal reproduction:
```python
def get_arrow_table_and_chunks(
dicts: List[Dict[str, Any]],
schema_keys: List[str],
) -> pl.DataFrame:
normalized_data = [
{key: row.get(key, None) for key in schema_keys} for row in dicts
]
# Convert to Arrow Table
arrow_table = pa.Table.from_pydict(
{key: [row[key] for row in normalized_data] for key in schema_keys}
)
print(f"Arrow Table schema: {arrow_table.schema}")
# Provide chunks of the Arrow Table to polars. If we have too much data
in a single chunk,
# we get these strange errors:
# pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
arrow_chunks = arrow_table.to_batches(max_chunksize=10000)
return arrow_table, arrow_chunks
placeholder_string = "1"*10
# Careful - this uses ~150GB of memory and takes a long time
data = [ { "i": placeholder_string, "nested": { "nested_string":
placeholder_string } } for _ in range(n_rows) ]
table, chunks = get_arrow_table_and_chunks(data, ["i", "nested"])
# Output:
# Arrow Table schema: i: string
# nested: struct<nested_string: string>
# child 0, nested_string: string
table["nested"].nbytes
# Output: 3,668,663,880
table["nested"].combine_chunks()
# ---------------------------------------------------------------------------
# ArrowInvalid Traceback (most recent call last)
# Cell In[12], line 1
# ----> 1 table["nested"].combine_chunks()
# File
~/Library/Caches/pypoetry/virtualenvs/decision-engine-LFRtXiTt-py3.12/lib/python3.12/site-packages/pyarrow/table.pxi:754,
in pyarrow.lib.ChunkedArray.combine_chunks()
# File
~/Library/Caches/pypoetry/virtualenvs/decision-engine-LFRtXiTt-py3.12/lib/python3.12/site-packages/pyarrow/array.pxi:4579,
in pyarrow.lib.concat_arrays()
# File
~/Library/Caches/pypoetry/virtualenvs/decision-engine-LFRtXiTt-py3.12/lib/python3.12/site-packages/pyarrow/error.pxi:155,
in pyarrow.lib.pyarrow_internal_check_status()
# File
~/Library/Caches/pypoetry/virtualenvs/decision-engine-LFRtXiTt-py3.12/lib/python3.12/site-packages/pyarrow/error.pxi:92,
in pyarrow.lib.check_status()
# ArrowInvalid: offset overflow while concatenating arrays
```
When I called `combine_chunks` on the same size of data without the nesting
of the struct, I did not get the error. I was able to reproduce this on MacOS
Sonoma 14.6.1 as well as the python:3.12.3-slim docker image which is based on
Debian 12.
Python version: 3.12.2
PyArrow version: 17.0.0
Polars version: 1.7.1
The error appears to be thrown from `PutOffsets` via `Concatenate` in
`concatenate.cc`
https://github.com/apache/arrow/blob/apache-arrow-17.0.0/cpp/src/arrow/array/concatenate.cc#L166
### Component(s)
C++, Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]