[ 
https://issues.apache.org/jira/browse/NIFI-15745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Villard updated NIFI-15745:
----------------------------------
    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

> Schema Inference is very inefficient when complex inner fields have many 
> nullable values
> ----------------------------------------------------------------------------------------
>
>                 Key: NIFI-15745
>                 URL: https://issues.apache.org/jira/browse/NIFI-15745
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Extensions
>            Reporter: Mark Payne
>            Assignee: Mark Payne
>            Priority: Major
>             Fix For: 2.9.0
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When we have records with inner "records" / "objects" and we're inferring 
> schema over many records, if some of the inner fields are nullable and 
> therefore not present (especially common in JSON) our inference creates a 
> UNION of record types. For example, if we had:
> {code:java}
> [{
>   "name": "Mark",
>   "project": {
>     "name": "nifi",
>     "org": "The Apache Software Foundation",
>     "yearEstablished": 2014
>   }
> },
> {
>   "name": "John",
>   "project": {
>     "name": "nifi",
>     "language": "Java",
>     "jiraProject": "NIFI"
>   },
>   "language": {
>     "name": "Java"
>   }
> }] {code}
> Each of these records has an inner-record with nullable fields so the schema 
> would define project as a {{UNION}} of two Record fields.
> This works okay for a simple example like this. But consider a FlowFile with 
> thousands or tens of thousands of Records, where inner objects can be very 
> complex. The UNION becomes massive, and it takes an inordinate amount of time 
> to infer the schema.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to