cdl-altium opened a new issue, #49158:
URL: https://github.com/apache/arrow/issues/49158
### Describe the bug, including details regarding any error messages,
version, and platform.
# Summary
PyArrow cannot read from a newline-delimited JSON file with inconsistent
column types, even if parse_options specifies a schema
This happens on PyArrow version 23.0.0 (current release) on Python 3.10
# Details
Consider a newline-delimited JSON file consisting of the following lines
```
{"_type": "part", "_id": "152934", "_op_type": "delete", "_index":
"my_index"}
{"_type": "part", "_id": 152934, "_op_type": "delete", "_index": "my_index"}
```
Note how `"_id"` is inconsistently quoted.
Then this code
```
import pyarrow
from pyarrow import json as pjson
names = ["_type", "_id", "_op_type", "_source", "_index"]
src_schema = pyarrow.schema([(x, pyarrow.string()) for x in names])
parse_options = pjson.ParseOptions(
explicit_schema=src_schema,
newlines_in_values=False,
unexpected_field_behavior = 'Ignore'
)
blob = pjson.read_json('failure_poc.json', parse_options = parse_options)
```
Generates an error
```
pyarrow.lib.ArrowInvalid: JSON parse error: Column(/_id) changed from string
to number in row 1
```
Expected behaviour would be for PyArrow to read the file entire, casting
`"_id"` to string as required.
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]