[
https://issues.apache.org/jira/browse/NIFI-15745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18068234#comment-18068234
]
ASF subversion and git services commented on NIFI-15745:
--------------------------------------------------------
Commit a681c97e2286412dece351f45528fd946889cb08 in nifi's branch
refs/heads/main from Mark Payne
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=a681c97e228 ]
NIFI-15745: Instead of inferring a UNION of many RECORD types, instead infer a
single RECORD type that is widened with all potential fields (#11039)
> Schema Inference is very inefficient when complex inner fields have many
> nullable values
> ----------------------------------------------------------------------------------------
>
> Key: NIFI-15745
> URL: https://issues.apache.org/jira/browse/NIFI-15745
> Project: Apache NiFi
> Issue Type: Improvement
> Components: Extensions
> Reporter: Mark Payne
> Assignee: Mark Payne
> Priority: Major
> Fix For: 2.9.0
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> When we have records with inner "records" / "objects" and we're inferring
> schema over many records, if some of the inner fields are nullable and
> therefore not present (especially common in JSON) our inference creates a
> UNION of record types. For example, if we had:
> {code:java}
> [{
> "name": "Mark",
> "project": {
> "name": "nifi",
> "org": "The Apache Software Foundation",
> "yearEstablished": 2014
> }
> },
> {
> "name": "John",
> "project": {
> "name": "nifi",
> "language": "Java",
> "jiraProject": "NIFI"
> },
> "language": {
> "name": "Java"
> }
> }] {code}
> Each of these records has an inner-record with nullable fields so the schema
> would define project as a {{UNION}} of two Record fields.
> This works okay for a simple example like this. But consider a FlowFile with
> thousands or tens of thousands of Records, where inner objects can be very
> complex. The UNION becomes massive, and it takes an inordinate amount of time
> to infer the schema.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)