[I] Force casting dynamic types to to string when using read_json with an explicit schema [arrow]

via GitHub Wed, 19 Feb 2025 05:26:37 -0800


kyrre opened a new issue, #45574:
URL: https://github.com/apache/arrow/issues/45574


   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   We want to use PyArrow for ETL jobs where JSON files are periodically read 
from Azure Blob Storage and inserted to Delta Lake tables. While the schemas 
are available some of the columns have a "dynamic type", e.g., we could have 
two rows in which the ActivityObjects column have these values:
   
   ActivityObjects -> [{"TargetUser": 1, "OperationType": 
"NetworkShareCreation"}, ..., ]
   ActivityObjects -> [{"MachineId": "05-10-15"}, ..., ]
   
   The way we have dealt with this in Spark is just to treat ActivityObjects as 
`array<string>` (or `string`) and do any additional parsing at query time. 
   
   However, if we try to do the same with PyArrow:
   
   ```python
   parse_options = pj.ParseOptions(explicit_schema=schema)
   events = (
       ibis.memtable(
           pj.read_json(
             jsonl_stream, 
             parse_options=parse_options
           )
        )
   )
   ```
   it throws an exception complaining it encountered a list instead of a 
string. 
   
   Is there way to force this behaviour? As I understand this will eventually 
be solved by the introduction of VariantType.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] Force casting dynamic types to to string when using read_json with an explicit schema [arrow]

Reply via email to