kirkrodrigues opened a new pull request, #11210:
URL: https://github.com/apache/pinot/pull/11210

   tags: feature, release-notes
   
   This adds a 
[RecordTransformer](https://github.com/apache/pinot/blob/master/pinot-segment-local/src/main/java/org/apache/pinot/segment/local/recordtransformer/RecordTransformer.java)
 to transform semi-structured (e.g., JSON) log events to fit a table's schema 
without dropping fields.
   
   JSON log events typically have a user-defined schema, so it is impractical 
to store each field in its own table column. At the same time, most (if not 
all) fields are important to the user, so we should not drop any field 
unnecessarily. Thus, this transformer primarily takes record-fields that don't 
exist in the schema and stores them in a type of catchall field.
   
   For example, consider this log event:
   ```
    {
      "timestamp": 1687786535928,
      "hostname": "host1",
      "level": "INFO",
      "message": "Started processing job1",
      "tags": {
        "platform": "data",
        "service": "serializer",
        "params": {
          "queueLength": 5,
          "timeout": 299,
          "userData_noIndex": {
            "nth": 99
          }
        }
      }
    }
   ```
    And let's say the table's schema contains these fields:
   * timestamp
   * hostname
   * level
   * message
   * tags.platform
   * tags.service
   * indexableExtras
   * unindexableExtras
   
    Without this transformer, the entire `tags` field would be dropped when 
storing the record in the table. However,
    with this transformer, the record would be transformed into the following:
   ```
    {
      "timestamp": 1687786535928,
      "hostname": "host1",
      "level": "INFO",
      "message": "Started processing job1",
      "tags.platform": "data",
      "tags.service": "serializer",
      "indexableExtras": {
        "tags": {
          "params": {
            "queueLength": 5,
            "timeout": 299
          }
        }
      },
      "unindexableExtras": {
        "tags": {
          "userData_noIndex": {
            "nth": 99
          }
        }
      }
    }
   ```
   
   Notice that the transformer:
   * Flattens nested fields which exist in the schema, like `tags.platform`
   * Moves fields which don't exist in the schema into the `indexableExtras` 
field
   * Moves fields which don't exist in the schema and have the suffix 
"_noIndex" into the `unindexableExtras` field
   
   The `unindexableExtras` field allows the transformer to separate fields 
which don't need indexing (because they are
    only retrieved, not searched) from those that do. The transformer also has 
other configuration options specified in `JsonLogTransformerConfig`.
   
   This is part of the change requested in #9819 and described in this [design 
doc](https://docs.google.com/document/d/1nHZb37re4mUwEA258x3a2pgX13EWLWMJ0uLEDk1dUyU/edit#heading=h.itv87iq05rqh).
   
   # Testing performed
   * Added new unit tests.
   * Validated JSON log events with dynamic schemas could be ingested into a 
table without dropping fields (unless configured to).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to