kyrre opened a new issue, #45574:
URL: https://github.com/apache/arrow/issues/45574
### Describe the usage question you have. Please include as many useful
details as possible.
We want to use PyArrow for ETL jobs where JSON files are periodically read
from Azure Blob Storage and inserted to Delta Lake tables. While the schemas
are available some of the columns have a "dynamic type", e.g., we could have
two rows in which the ActivityObjects column have these values:
ActivityObjects -> [{"TargetUser": 1, "OperationType":
"NetworkShareCreation"}, ..., ]
ActivityObjects -> [{"MachineId": "05-10-15"}, ..., ]
The way we have dealt with this in Spark is just to treat ActivityObjects as
`array<string>` (or `string`) and do any additional parsing at query time.
However, if we try to do the same with PyArrow:
```python
parse_options = pj.ParseOptions(explicit_schema=schema)
events = (
ibis.memtable(
pj.read_json(
jsonl_stream,
parse_options=parse_options
)
)
)
```
it throws an exception complaining it encountered a list instead of a
string.
Is there way to force this behaviour? As I understand this will eventually
be solved by the introduction of VariantType.
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]