edmcman opened a new issue, #46040: URL: https://github.com/apache/arrow/issues/46040
### Describe the enhancement requested JSON is extremely flexible, and arrow's [automatic type inference rules](https://arrow.apache.org/docs/python/json.html#automatic-type-inference) often fails to find a schema. Here is a simple example on which `pyarrow.json.read_json` returns `ArrowInvalid: JSON parse error: Column(/lol/[]) changed from number to array in row 0` ```json { "lol": [42, []]} ``` I'm attempting to help solve [this issue](https://github.com/huggingface/datasets/issues/5950) which is basically to cope with such "unstructured" JSON formats. AFAICT Arrow doesn't make this easy, because parsing/conversion simply fails without an obvious way to move forward. It would be nice if there was a way to proceed even if type inference fails. Here are a few ideas: * If type inference fails, optionally use a general container type (`JsonType`?) instead. This could perhaps be a ParseOption passed to `read_json`. * Allow the user to augment (e.g., a callback) the type inference process, especially when it would otherwise fail. Again, this could be a ParseOption. * Expose the type inference algorithm as a separate function, with a way to explicitly indicate failures in the schema. In other words, don't throw an exception, but rather mark that part of the schema in some way as needing additional input from the caller. If I missed an obvious way of coping with these unstructured JSON files, please let me know! ### Component(s) Python, Other -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org