el-hult opened a new issue, #45087:
URL: https://github.com/apache/arrow/issues/45087
### Describe the bug, including details regarding any error messages,
version, and platform.
The R arrow library cannot load a file with schema
```
schema: codes: large_list<element: dictionary<values=string, indices=int32,
ordered=0>>
child 0, element: dictionary<values=string, indices=int32, ordered=0>
```
if the table is chunked. To reproduce, run below python script. in an
environment that also has R with arrow installed
```python
import pyarrow as pa
import pyarrow.parquet as pq
import subprocess
def test_load_parquet(table,label):
pq.write_table(table, "t.parquet", row_group_size=1)
res = subprocess.run(
["Rscript", "-e",
'library(arrow);t=arrow::read_parquet("t.parquet");'],
capture_output=True,
)
print(f'{label}\n#####')
if res.returncode != 0:
stdErr = res.stderr.decode()
assert "NotImplemented: Nested data conversions not implemented for
chunked array outputs" in stdErr
print('R failed')
else:
print('R ok')
pq.read_table("t.parquet") # no error!
print("python ok")
print("schema:",pq.read_schema("t.parquet"))
codes = [["a"],["a"]]
t1 = pa.table({"codes": codes})
t2 = pa.table({"codes": codes}).cast(
pa.schema({"codes": pa.large_list(pa.dictionary(pa.int32(),
pa.string()))})
)
t3 = pa.table({"codes": codes}).cast(
pa.schema({"codes": pa.list_(pa.dictionary(pa.int32(), pa.string()))})
)
test_load_parquet(t1,'t1')
test_load_parquet(t2,'t2')
test_load_parquet(t3,'t3')
```
to get the output
```
t1
#####
R ok
python ok
schema: codes: list<element: string>
child 0, element: string
t2
#####
R failed
python ok
schema: codes: large_list<element: dictionary<values=string, indices=int32,
ordered=0>>
child 0, element: dictionary<values=string, indices=int32, ordered=0>
t3
#####
R failed
python ok
schema: codes: list<element: dictionary<values=string, indices=int32,
ordered=0>>
child 0, element: dictionary<values=string, indices=int32, ordered=0>
```
I have verified this is an issue in R library versions 13.0.0.0 and 18.1.0.
both list_ and large_list fails.
The error reported by the R library is discussed in #32723 , but since this
works in pyarrow, I guess this is a separate issue from the C++ issue.
### Component(s)
Parquet, R
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]