[I] Provide pinot schema when initializing StreamMessageDecoder [pinot]

via GitHub Wed, 28 Feb 2024 23:14:17 -0800


rseetham opened a new issue, #12521:
URL: https://github.com/apache/pinot/issues/12521


   
[StreamMessageDecoder's](https://github.com/apache/pinot/blob/ac13a191b945a80084f0a2794391e4be2f463252/pinot-spi/src/main/java/org/apache/pinot/spi/stream/StreamMessageDecoder.java#L49)
 init is
   `void init(Map<String, String> props, Set<String> fieldsToRead, String 
topicName)`
   
   It would be great if the decoder has access to the pinot schema as well. At 
Uber, we have our own decoder internally to decode avro messages. We use the 
AvroRecordExtractor at the end but we need access to the pinot schema to do 
some custom things. 
   Initially, this class has access to the pinot schema but that was [removed 
in 2020](https://github.com/apache/pinot/pull/5309).
   This was done because 
   
   > RecordReader and StreamMessageDecoder is the entry point for batch and 
streaming data ingestion. They are expected to be implemented and plugged to 
provide customized format support.
   To make the abstraction more crispy and easier to understand, remove the 
Schema and replace it with fields to read so that users do not need to worry 
about extracting fields from the Pinot schema when adding a new format.
   
   fieldsToRead is generated 
[here](https://github.com/apache/pinot/blob/master/pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/RealtimeSegmentDataManager.java#L1477)
 using
   `Set<String> fieldsToRead = 
IngestionUtils.getFieldsForRecordExtractor(_tableConfig.getIngestionConfig(), 
_schema);`
   In the 
[implmentation](https://github.com/apache/pinot/blob/master/pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/IngestionUtils.java#L310),
 if SchemaConformingTransformerConfig is present, we will return empty 
fieldsToRead. If the fieldsToRead is empty, other parts of the decoder code, 
assume that we have to extract all the fields in the input schema anyway. 
[Example](https://github.com/apache/pinot/blob/master/pinot-plugins/pinot-input-format/pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroRecordExtractor.java#L52).
   
   The request here is to add schema to the initializer of 
StreamMessageDecoder. It would be great if the StreamMessageDecoder had access 
to the schema. The fieldsToRead will still be there and used for existing 
reasons but the schema is a nice to have in the decoder. (In our case, we want 
to know what the time column). Even in general, if the decoder wants to do 
specific stuff based on the pinot schema it would be nice to have access to the 
schema.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

[I] Provide pinot schema when initializing StreamMessageDecoder [pinot]

Reply via email to