npawar commented on PR #9224:
URL: https://github.com/apache/pinot/pull/9224#issuecomment-1247456006

   > Design doc: 
https://docs.google.com/document/d/1kTUfBud1SBSh_703mvu6ybkbIwiKKH9eXpdcpEmhC2E/edit
   > 
   > This is an extension of PR #9096
   > 
   > # Motivation
   > Most stream systems provide a message envelope, which encapsulates the 
record payload, along with record headers, keys and other system-specific 
metadata For e.g:
   > 
   > 1. Kafka allows keyed records and additionally, provides headers
   > 2. Kinesis requires keyed records and includes some additional metadata 
such as sequenceId etc
   > 3. Pulsar also supports keyed records and allows including arbitrary 
properties.
   > 4. Pubsub supports keyed messages, along with user-defined attributes and 
message metadata.
   > 
   > Today, Pinot drops everything from the payload, other than the record 
value itself. Hence, there needs to be a way to extract these values and 
present them in the Pinot table as regular columns (of course, it has to be 
defined in the pinot schema).
   > 
   > This can be very useful for the Pinot user as they don't have to 
"pre-process" the stream to make the record metadata available in the data 
payload. It also prevents custom solutions (such as 
[this](https://github.com/startreedata/startree-pinot/pull/484/files)).
   > 
   > # Context
   > Want to clarify the terminology here. Typically, in most streaming 
systems, a record is composed of the following:
   > 
   > 1. Record key - usually, a string, although kafka allows any type (today, 
`pinot-kafka` connector assumes the key to always be a key)
   > 2. Record value - actual data paylaod. Pinot extract only this value and 
decodes it.
   > 3. Record headers - these are user-defined record header that can be 
specific to the publishing application. Typically, headers are meant to be 
efficient and small. For example, in Kafka , it allows <String, byte[]>. 
technically, `byte[]` can be anything and we can make a call on whether to 
support arbitrary header value types or not.
   > 4. Record Metadata - these may or may not be included in the record 
payload and it is system-defined. For example, for message identifiers, kinesis 
has `sequenceId`, kafka has `offset`, pubsub has `messageId` etc. While these 
may not be useful for the user-facing application, it comes-in handy for 
debugging.
   > 
   > # What does this PR do?
   > This PR attempts to extract key, header and other metadata from any 
supported streaming connector. This feature is opt-in, meaning it can be 
enabled by setting `stream.$streamType.metadata.populate` as `true`
   > 
   > please note:
   > 
   > 1. I am in the process of adding some unit tests. I have tested with a 
pinot realtime quickstart. Need to do some more cleanup.
   > 2. For whatever reason, the integration tests fail in the CI pipeline 
here, where as it runs fine on my laptop. Still fixing forward.
   > 3. Documentation for this feature will follow after this PR is merged.
   > 
   > For Reviewers, things to discuss:
   > 
   > 1. In the current patch, the record key (when available) is extracted as 
`__key` column , where as headers are extracted as `header$<HEADER_KEY_NAME>` . 
Does this sound like a good convention to follow for all stream connectors  -> 
Header columns will always be prefixed with `header$` and any other metadata 
such as key or offset will be prefixed as `__`
   > 2. In `MessageBatch`, I have marked one of the methods as `@Deprecated` as 
I am hoping to eventually eliminate the need for typed interface there. The 
current changes are backwards compatible. Let me know if there is a better way.
   
   Would prefer if we're able to keep it all consistent in terms of the prefix 
(if going with __, then __key, __header$headerName, __metadata)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to