kirkrodrigues opened a new pull request, #11210: URL: https://github.com/apache/pinot/pull/11210
tags: feature, release-notes This adds a [RecordTransformer](https://github.com/apache/pinot/blob/master/pinot-segment-local/src/main/java/org/apache/pinot/segment/local/recordtransformer/RecordTransformer.java) to transform semi-structured (e.g., JSON) log events to fit a table's schema without dropping fields. JSON log events typically have a user-defined schema, so it is impractical to store each field in its own table column. At the same time, most (if not all) fields are important to the user, so we should not drop any field unnecessarily. Thus, this transformer primarily takes record-fields that don't exist in the schema and stores them in a type of catchall field. For example, consider this log event: ``` { "timestamp": 1687786535928, "hostname": "host1", "level": "INFO", "message": "Started processing job1", "tags": { "platform": "data", "service": "serializer", "params": { "queueLength": 5, "timeout": 299, "userData_noIndex": { "nth": 99 } } } } ``` And let's say the table's schema contains these fields: * timestamp * hostname * level * message * tags.platform * tags.service * indexableExtras * unindexableExtras Without this transformer, the entire `tags` field would be dropped when storing the record in the table. However, with this transformer, the record would be transformed into the following: ``` { "timestamp": 1687786535928, "hostname": "host1", "level": "INFO", "message": "Started processing job1", "tags.platform": "data", "tags.service": "serializer", "indexableExtras": { "tags": { "params": { "queueLength": 5, "timeout": 299 } } }, "unindexableExtras": { "tags": { "userData_noIndex": { "nth": 99 } } } } ``` Notice that the transformer: * Flattens nested fields which exist in the schema, like `tags.platform` * Moves fields which don't exist in the schema into the `indexableExtras` field * Moves fields which don't exist in the schema and have the suffix "_noIndex" into the `unindexableExtras` field The `unindexableExtras` field allows the transformer to separate fields which don't need indexing (because they are only retrieved, not searched) from those that do. The transformer also has other configuration options specified in `JsonLogTransformerConfig`. This is part of the change requested in #9819 and described in this [design doc](https://docs.google.com/document/d/1nHZb37re4mUwEA258x3a2pgX13EWLWMJ0uLEDk1dUyU/edit#heading=h.itv87iq05rqh). # Testing performed * Added new unit tests. * Validated JSON log events with dynamic schemas could be ingested into a table without dropping fields (unless configured to). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org