jackjlli opened a new issue #8306:
URL: https://github.com/apache/pinot/issues/8306


   We’re planning to upgrade the Apache Helix version from 0.9.8 to 1.0+ in 
Pinot repo. This will not only address the issues we’ve seen in the current 
0.9.8 release (which have been addressed in Helix 1.0+ and the Helix project 
will not plan on back-porting to 0.9.x), but also provide the opportunities to 
build more Pinot features with the new features in Helix 1.0+. An [initial 
attempt](https://github.com/apache/pinot/pull/7500) at moving to 1.0 has also 
detected some test failures in Pinot, so this time we’ve sought help from the 
Helix team.
   
   ### What we get from Helix 1.0+
   Here are a few items that can be addressed after bumping up the Helix 
version to 1.0+.
   **ZNRecord serialization issue**
   In the serialize() method of 
[ZNRecordSerializer](https://github.com/apache/pinot/pull/7500), a very 
expensive Jackson “ObjectMapper” is constructed every time, and this is already 
addressed in Helix 1.0+.
   Burst ZK write issue during brokers startup in large cluster
   In a large Pinot cluster where there are thousands of Pinot tables, broker 
restart generates a burst of ZK access to the current state causes, and the 
Helix controller takes longer (20 mins) to calculate the ideal state. The 
algorithm of calculating the ideal state is improved in the later Helix 1.0+ 
release. 
   
   **Lack of pagination support**
   Because of the lack of pagination support in Helix 0.9.8, a huge amount of 
ZNodes needs to be read from ZK to Pinot during the startup, which will cause a 
huge burst of ZK read and write access, especially in the huge cluster which 
maintains thousands of Pinot tables. This pain point can be addressed by the 
[Zk Client API 
pagination](https://github.com/apache/helix/wiki/Helix-ZkClient-API-to-Support-Getting-a-Large-Number-of-Children-Using-Pagination)
 feature in the Helix 1.0+ version. This feature is needed to support Pinot 
clusters with a large number of tables and segments.
   
   **State transition task prioritization**
   Currently Helix tasks are picked up by the participant based on the inQueue 
time. While there is some scenario that the tasks which are queued later need 
to be picked up first (due to some constraints like disk usage, etc). In Pinot 
the custom Helix state model called "SegmentOnlineOfflineStateModel" is used 
for segment level state transition. The state transition "OFFLINE->ONLINE" 
downloads a new segment to the local disk, and the one "OFFLINE->DROPPED" 
deletes the segment from the local disk. While we notice that the state 
transition "OFFLINE->ONLINE" always comes before "OFFLINE->DROPPED", which 
makes the pinot server busy downloading new segments and then fills up the full 
disk. This [issue](https://github.com/apache/helix/issues/1889) will be 
addressed only in the Helix 1.0+ version.
   
   ### What opportunities that Helix 1.0+ provides
   Other than that, there are several other features in Helix 1.0+ that can be 
the building blocks for future Pinot features.
   
   **Leverage weight based rebalancer**
   The new [weight based 
rebalancing](https://github.com/apache/helix/wiki/Weight-aware-Globally-even-distribute-Rebalancer)
 algorithms can be added to Pinot to support features like weight-based segment 
assignment and weight-based routing assignment. 
   _Weight-based segment assignment_
   
   Right now, Pinot considers all segments as having the same weight. This may 
not always hold true once we land onto Helix 1.0+. With this new algorithm, new 
Pinot segments could have the opportunity to be assigned to the hardware with 
more resources like higher ram, larger SSD, etc, as the newer segments might be 
queried more frequently than the older ones. Whereas the older segment can be 
moved to the cheaper hardware, in order to reduce cost.
   _Weight-based broker routing assignment_
   
   Currently all the brokers with the same tag will be regarded as the same. 
All the queries with totally different query patterns would be routed to the 
same host. With this new algorithm, brokers with different resources can have 
the ability to handle different kinds of query patterns.
   
   **Leverage 
[FederatedZkClient](https://github.com/apache/helix/wiki/FederatedZkClient)**
   The FederatedZkClient feature has the ability to maintain multiple ZK 
connections to different ZK realms, which provides the ability for Pinot to 
consider splitting the large Pinot cluster into multiple ones.
   
   Helix 1.0+ has been in use in production at LinkedIn for years, and it’s 
considered to be stable by the Helix team. At LinkedIn there are a wide variety 
of Pinot use cases that cover all the scenarios interacting with Helix. 
   
   ### Approach
   We’re going to follow the steps below in order to make the upgrade smoothly. 
Any step with a higher number cannot proceed if any of the steps with lower 
numbers are blocked.
   
   - Step 1: Create a branch and change pinot source code to be on Helix-1.x in 
the branch
   
   - Step 2: Deploy Pinot with Helix 1.x (from the branch) on LinkedIn staging 
and production environments and validate (this step may take a few weeks 
depending on the problems encountered)
   
   - Step 3: Merge the open source branch back to the master branch
   
   We’ll also update the status to this issue on the completeness of each of 
the steps.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to