cdl-altium opened a new issue, #49158:
URL: https://github.com/apache/arrow/issues/49158

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   # Summary
   
   PyArrow cannot read from a newline-delimited JSON file with inconsistent 
column types, even if parse_options specifies a schema
   
   This happens on PyArrow version 23.0.0 (current release) on Python 3.10
   
   
   # Details
   
   Consider a newline-delimited JSON file consisting of the following lines
   
   ```
   {"_type": "part", "_id": "152934", "_op_type": "delete", "_index": 
"my_index"}
   {"_type": "part", "_id": 152934, "_op_type": "delete", "_index": "my_index"}
   ```
   
   Note how `"_id"` is inconsistently quoted.
   
   Then this code
   
   ```
   import pyarrow
   from pyarrow import json as pjson
   
   names = ["_type", "_id", "_op_type", "_source", "_index"]
   src_schema = pyarrow.schema([(x, pyarrow.string()) for x in names])
   parse_options = pjson.ParseOptions(
       explicit_schema=src_schema,
       newlines_in_values=False,
       unexpected_field_behavior = 'Ignore'
   )
   
   blob = pjson.read_json('failure_poc.json', parse_options = parse_options)
   ```
   
   Generates an error
   
   ```
   pyarrow.lib.ArrowInvalid: JSON parse error: Column(/_id) changed from string 
to number in row 1
   ```
   
   Expected behaviour would be for PyArrow to read the file entire, casting 
`"_id"` to string as required.
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to