udaysagar2177 commented on issue #17331:
URL: https://github.com/apache/pinot/issues/17331#issuecomment-3669720038

   Thanks for bringing up WarpStream. That’s a helpful comparison, and I agree 
there’s some conceptual overlap. I haven’t done a deep evaluation of 
WarpStream, but a few thoughts of my own suggest exploring this approach within 
the existing Pinot/Kafka abstractions.
   
   1. Ecosystem and experimentation considerations
   WarpStream seems valuable for reducing Kafka-related costs, but being closed 
source makes prototyping and adoption a bit harder. Depending on usage and 
pricing, introducing WarpStream could add incremental costs. There is also an 
open-source alternative called AutoMQ, though adopting it without paid support 
would require operating an additional system.
   
   2. Kafka as a lightweight coordination layer
   Kafka-based pipelines are likely to exist for other use cases in a system 
architecture. In that context, maintaining a small Kafka topic for micro-batch 
descriptors appears manageable. Kafka benefits from strong community support 
and availability through managed service offerings.
   
   3. File-centric ingestion as a first-class path
   In some pipelines, data is naturally produced as Avro or Parquet in object 
storage to serve other downstream consumers. For those cases, creating a 
Kafka-style message production model solely for Pinot ingestion may be less 
desirable. Leveraging the PinotFS abstraction allows ingestion to remain closer 
to existing file-based workflows without requiring additional frameworks (e.g., 
Spark or Minion) to transform file data into a Kafka stream.
   
   Caveats and considerations
   
   Data routing generally needs to be determined before file generation so 
files can map cleanly to Kafka partitions consumed by Pinot. If that isn’t the 
case, files may need to be split or reorganized, which could increase 
object-store API costs and system complexity. To address this:
   - Referencing exact byte ranges within a file is possible, but this approach 
would diverge from widely adopted file formats and their tooling ecosystem.
   - Embedding filtering logic in the micro-batch descriptor is another option, 
though it could introduce additional design complexity.
   
   The current abstractions do support micro-batch processing, and I am 
actively working through the necessary implementation details. While it appears 
feasible, I see a few areas that might complicate it, such as handling consumer 
poll intervals to accommodate rebalances during micro-batch processing, reduced 
visibility into backlog, increased compute load to parse data for Pinot 
servers, etc. I’m happy to submit a work-in-progress PR to help us assess 
whether this is still a good idea.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to