lhotari commented on PR #668:
URL: 
https://github.com/apache/pulsar-helm-chart/pull/668#issuecomment-4092987511

   > @lhotari you're very active in the pulsar repos. I was wondering what you 
thought of this, and if you had any thoughts about how to make upgrades more 
controlled in the future (when using helm)?
   
   In the issue, you mentioned "Not having control restarts all components at 
once which renders a fully-operational cluster in a bad error state." I think 
that wouldn't be expected when rolling restarts are performed, and it's a bug. 
Please share more details of what type of bad error state you end up in. I know 
that there are some bugs that could cause this. Sharing the Pulsar version 
would help to see if there's a fix in newer versions. The client version could 
also matter in some cases. Sharing that would be helpful too.
   
   In general, it would be useful to perform upgrades "slowly" so that each set 
of components is handled separately and upgraded before moving on to the next 
ones. For example, upgrading ZooKeeper, then BookKeeper, and finally Brokers & 
Proxies. The order doesn't matter that much since newer versions should always 
be able to talk to older component versions.
   
   Even without handling restarts separately, it shouldn't result in the 
cluster getting into a bad error state unless the high load causes the system 
to collapse when there are a lot of component restarts at once.
   
   When the Pulsar version is upgraded, one possible solution is to manage the 
images for the different components separately in values.yaml and not rely on 
the default that changes the image for all components at once. In that case, 
one would perform multiple Helm deployments while upgrading. This would work 
for cases where only the Pulsar image is upgraded. However, if the chart is 
upgraded, there could be many changes that impact multiple different components 
and cause them to restart.
   
   This is just one thought on some solutions. It would be great if you could 
contribute a section to the README.md file about handling upgrades in a 
controlled way and what problems it resolves.
   
   One known issue with brokers in a full rolling restart is that there's also 
a lot of shuffling due to load balancing. Bundles get moved across brokers 
resulting in disruptions in traffic for producers and consumers until the 
cluster stabilizes itself. This mainly matters at very high throughput / 
workload where resources aren't heavily over-provisioned.
   There has been a plan to address this problem with 
https://github.com/apache/pulsar/blob/master/pip/pip-192.md and 
https://github.com/apache/pulsar/blob/master/pip/pip-307.md. The implementation 
exists, but there are experiences that it's not that stable and would require 
more contributions to harden it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to