lhotari commented on PR #668: URL: https://github.com/apache/pulsar-helm-chart/pull/668#issuecomment-4092987511
> @lhotari you're very active in the pulsar repos. I was wondering what you thought of this, and if you had any thoughts about how to make upgrades more controlled in the future (when using helm)? In the issue, you mentioned "Not having control restarts all components at once which renders a fully-operational cluster in a bad error state." I think that wouldn't be expected when rolling restarts are performed, and it's a bug. Please share more details of what type of bad error state you end up in. I know that there are some bugs that could cause this. Sharing the Pulsar version would help to see if there's a fix in newer versions. The client version could also matter in some cases. Sharing that would be helpful too. In general, it would be useful to perform upgrades "slowly" so that each set of components is handled separately and upgraded before moving on to the next ones. For example, upgrading ZooKeeper, then BookKeeper, and finally Brokers & Proxies. The order doesn't matter that much since newer versions should always be able to talk to older component versions. Even without handling restarts separately, it shouldn't result in the cluster getting into a bad error state unless the high load causes the system to collapse when there are a lot of component restarts at once. When the Pulsar version is upgraded, one possible solution is to manage the images for the different components separately in values.yaml and not rely on the default that changes the image for all components at once. In that case, one would perform multiple Helm deployments while upgrading. This would work for cases where only the Pulsar image is upgraded. However, if the chart is upgraded, there could be many changes that impact multiple different components and cause them to restart. This is just one thought on some solutions. It would be great if you could contribute a section to the README.md file about handling upgrades in a controlled way and what problems it resolves. One known issue with brokers in a full rolling restart is that there's also a lot of shuffling due to load balancing. Bundles get moved across brokers resulting in disruptions in traffic for producers and consumers until the cluster stabilizes itself. This mainly matters at very high throughput / workload where resources aren't heavily over-provisioned. There has been a plan to address this problem with https://github.com/apache/pulsar/blob/master/pip/pip-192.md and https://github.com/apache/pulsar/blob/master/pip/pip-307.md. The implementation exists, but there are experiences that it's not that stable and would require more contributions to harden it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
