mangrrua opened a new pull request #5787:
URL: https://github.com/apache/incubator-pinot/pull/5787


   ## Description
   Pinot is a great tool for OLAP queries in real-time. In many cases, users 
want to see aggregated results in realtime. In the real scenarios, all data do 
not come in real-time, or some data/reports should be calculated in batch(every 
hour, day etc). 
   [Apache Spark](https://spark.apache.org/) is a great tool for batch 
computing, data preparation and etc in ETL processes. And a lot of companies 
are using Apache Spark for their purposes. 
   
   Apache Spark and Apache Pinot are great with together for data preparation, 
aggregation, and query for many cases. But integrations between tools(eg 
spark-pinot) are so important! Pinot has a `spark-batch-ingestion` module, but 
it requires some efforts, and i know many developers like me do not want to do 
these efforts. These efforts are;
   
   - User should analyze data with spark(eg), then it should save outputs to 
hdfs with parquet, orc or etc format. 
   - Trigger an `spark-batch-ingestion` job to convert analyzed results to 
offline segments. And wait for finish(also some trigger mechanisms and fail 
scenarios etc). 
   - `spark-batch-ingestion` reads input files analyzed above(each of these 
files represents one segment, and if user want to partition data in pinot, 
these input files must be partitioned), and creates segments, then writes it to 
the deep storage. 
   
   Unnecessary step is so painful! But not finished! 
   
   What happens we want to re-index data, or apply more aggregation in Pinot? 
For example, we want to re-index orders data by another dimension or etc. Or 
maybe another department in your company want to access your pinot data? Yes, 
all steps should be applied from zero. 
   
   What is the suggested solution? A connector that read/write from/to pinot 
directly. 
   
   If we can read/write data to pinot from spark directly, we can;
   
   - Pinot(source) -> Spark(analyze) -> Pinot(sink)
   - Pinot(source) -> Spark(analyze) -> Somewhere(hdfs, cassandra, postgres etc)
   - Somewhere(hdfs, cassandra, postgres etc) -> Spark(analyze) -> Pinot(sink)
   
   We can solve some problems with spark connector. 
   
   Connector supports only read for now. We are waiting new segment write API 
for write operation to prevent duplicate effort. This is the just initial 
version. In the future, `streaming endpoints` will applied for read, and write 
API will be added.
   
   For this version, look the [pinot-spark-connector 
README](https://github.com/mangrrua/incubator-pinot/blob/spark-pinot-connector/pinot-connectors/pinot-spark-connector/README.md)
 and [read-model 
documentation](https://github.com/mangrrua/incubator-pinot/blob/spark-pinot-connector/pinot-connectors/pinot-spark-connector/documentation/read_model.md)
 for detailed info(I will move documentations 
https://docs.pinot.apache.org/developers/developers-and-contributors/update-document).
   
   **Note:** Presto is the powerful engine for joins or other operations, but 
it does not the same with spark! Uses cases are different. Just focus spark 
ecosystem and etl processes. 
   
   Please share your comments and improvements. 
   
   ## Upgrade Notes
   Does this PR prevent a zero down-time upgrade? (Assume upgrade order: 
Controller, Broker, Server, Minion)
   * [ ] Yes (Please label as **<code>backward-incompat</code>**, and complete 
the section below on Release Notes)
   
   Does this PR fix a zero-downtime upgrade introduced earlier?
   * [ ] Yes (Please label this as **<code>backward-incompat</code>**, and 
complete the section below on Release Notes)
   
   Does this PR otherwise need attention when creating release notes? Things to 
consider:
   - New configuration options
   - Deprecation of configurations
   - Signature changes to public methods/interfaces
   - New plugins added or old plugins removed
   * [ ] Yes (Please label this PR as **<code>release-notes</code>** and 
complete the section on Release Notes)
   ## Release Notes
   If you have tagged this as either backward-incompat or release-notes,
   you MUST add text here that you would like to see appear in release notes of 
the
   next release.
   
   If you have a series of commits adding or enabling a feature, then
   add this section only in final commit that marks the feature completed.
   Refer to earlier release notes to see examples of text
   
   ## Documentation
   If you have introduced a new feature or configuration, please add it to the 
documentation as well.
   See 
https://docs.pinot.apache.org/developers/developers-and-contributors/update-document
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to