[I] Flexible handling of conflicts during JSON type inference [arrow]

via GitHub Mon, 07 Apr 2025 07:05:16 -0700


edmcman opened a new issue, #46040:
URL: https://github.com/apache/arrow/issues/46040


   ### Describe the enhancement requested
   
   JSON is extremely flexible, and arrow's [automatic type inference 
rules](https://arrow.apache.org/docs/python/json.html#automatic-type-inference) 
often fails to find a schema.
   
   Here is a simple example on which `pyarrow.json.read_json` returns 
`ArrowInvalid: JSON parse error: Column(/lol/[]) changed from number to array 
in row 0`
   ```json
   { "lol": [42, []]}
   ```
   
   I'm attempting to help solve [this 
issue](https://github.com/huggingface/datasets/issues/5950) which is basically 
to cope with such  "unstructured" JSON formats.  AFAICT Arrow doesn't make this 
easy, because parsing/conversion simply fails without an obvious way to move 
forward.  It would be nice if there was a way to proceed even if type inference 
fails.  Here are a few ideas:
   
   * If type inference fails, optionally use a general container type 
(`JsonType`?) instead. This could perhaps be a ParseOption passed to 
`read_json`.
   * Allow the user to augment (e.g., a callback) the type inference process, 
especially when it would otherwise fail.  Again, this could be a ParseOption.
   * Expose the type inference algorithm as a separate function, with a way to 
explicitly indicate failures in the schema.  In other words, don't throw an 
exception, but rather mark that part of the schema in some way as needing 
additional input from the caller.
   
   If I missed an obvious way of coping with these unstructured JSON files, 
please let me know!
   
   ### Component(s)
   
   Python, Other


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] Flexible handling of conflicts during JSON type inference [arrow]

Reply via email to