udaysagar2177 commented on issue #17331: URL: https://github.com/apache/pinot/issues/17331#issuecomment-3669720038
Thanks for bringing up WarpStream. That’s a helpful comparison, and I agree there’s some conceptual overlap. I haven’t done a deep evaluation of WarpStream, but a few thoughts of my own suggest exploring this approach within the existing Pinot/Kafka abstractions. 1. Ecosystem and experimentation considerations WarpStream seems valuable for reducing Kafka-related costs, but being closed source makes prototyping and adoption a bit harder. Depending on usage and pricing, introducing WarpStream could add incremental costs. There is also an open-source alternative called AutoMQ, though adopting it without paid support would require operating an additional system. 2. Kafka as a lightweight coordination layer Kafka-based pipelines are likely to exist for other use cases in a system architecture. In that context, maintaining a small Kafka topic for micro-batch descriptors appears manageable. Kafka benefits from strong community support and availability through managed service offerings. 3. File-centric ingestion as a first-class path In some pipelines, data is naturally produced as Avro or Parquet in object storage to serve other downstream consumers. For those cases, creating a Kafka-style message production model solely for Pinot ingestion may be less desirable. Leveraging the PinotFS abstraction allows ingestion to remain closer to existing file-based workflows without requiring additional frameworks (e.g., Spark or Minion) to transform file data into a Kafka stream. Caveats and considerations Data routing generally needs to be determined before file generation so files can map cleanly to Kafka partitions consumed by Pinot. If that isn’t the case, files may need to be split or reorganized, which could increase object-store API costs and system complexity. To address this: - Referencing exact byte ranges within a file is possible, but this approach would diverge from widely adopted file formats and their tooling ecosystem. - Embedding filtering logic in the micro-batch descriptor is another option, though it could introduce additional design complexity. The current abstractions do support micro-batch processing, and I am actively working through the necessary implementation details. While it appears feasible, I see a few areas that might complicate it, such as handling consumer poll intervals to accommodate rebalances during micro-batch processing, reduced visibility into backlog, increased compute load to parse data for Pinot servers, etc. I’m happy to submit a work-in-progress PR to help us assess whether this is still a good idea. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
