Hi Bruce, Thanks a lot for your answer. We had not thought about the changes in distributed algorithms when analyzing rolling downgrades.
Rolling downgrade is a pretty important requirement for our customers so we would not like to close the discussion here and instead try to see if it is still reasonable to propose it for Geode maybe relaxing a bit the expectations and clarifying some things. First, I think supporting rolling downgrade does not mean making it impossible to upgrade distributed algorithms. It means that you need to support the new and the old algorithms (just as it is done today with rolling upgrades) in the upgraded version and also support the possibility of switching to the old algorithm in a fully upgraded system. Second of all, I would say it is not very common to upgrade distributed algorithms, or at least, it does not seem to have been the case so far in Geode. Therefore, the burden of adding the logic to support the rolling downgrade would not be something to be carried in every release. In my opinion, it will be some extra percentage of work to be added to the work to support the rolling upgrade of the algorithm as the rolling downgrade will probably be using the mechanisms implemented for the rolling upgrade. Third, we do not need to support the rolling downgrade from any release to any other older release. We could just support the rolling downgrade (at least when distributed algorithms are changed) between consecutive versions. They could be considered special cases like those when it is required to provide a tool to convert files in order to assure compatibility. -Alberto ________________________________ From: Bruce Schuchardt <bschucha...@pivotal.io> Sent: Thursday, April 16, 2020 5:04 PM To: dev@geode.apache.org <dev@geode.apache.org> Subject: Re: About Geode rolling downgrade -1 Another reason that we should not support rolling downgrade is that it makes it impossible to upgrade distributed algorithms. When we added rolling upgrade support we pretty much immediately ran into a distributed hang when a test started a Locator using an older version. In that release we also introduced the cluster configuration service and along with that we needed to upgrade the distributed lock service's notion of the "elder" member of the cluster. Prior to that change a Locator could not fill this role, but the CCS needed to be able to use locking and needed a Locator to be able to fill this role. During upgrade we used the old "elder" algorithm but once the upgrade was finished we switched to the new algorithm. If you introduced an older Locator into this upgraded cluster it wouldn't think that it should be the "elder" but the rest of the cluster would expect it to be the elder. You could support rolling downgrade in this scenario with extra logic and extra testing, but I don't think that will always be the case. Rolling downgrade support would place an immense burden on developers in extra development and testing in order to ensure that older algorithms could always be brought back on-line. On 4/16/20, 4:24 AM, "Alberto Gomez" <alberto.go...@est.tech> wrote: Hi, Some months ago I posted a question on this list (see [1]) about the possibility of supporting "rolling downgrade" in Geode in order to downgrade a Geode system to an older version, similar to the "rolling upgrade" currently supported. With your answers and my investigations my conclusion was that the main stumbling block to support "rolling downgrades" was the compatibility of persistent files which was very hard to achieve because old members would require to be prepared to support newer versions of persistent files. We have come up with a new approach to support rolling downgrades in Geode which consists of the following procedure: - For each locator: - Stop locator - Remove locator files - Start locator in older version - For each server: - Stop server - Remove server files - Revoke missing-disk-stores for server - Start server in older version Some extra details about this procedure: - The starting and stopping of processes may not be able to be done using gfsh as gfsh does not allow to manage members in a different version than its own. - Redundancy in servers is required - More than one locator is required - The allow_old_members_to_join_for_testing needs to be passed to the members. I would like to ask two questions regarding this procedure: - Do you see any issue not considered by this procedure or any alternative to it? - Would it be reasonable to make public the "allow_old_members_to_join_for_testing" parameter (with a new name) so that it might be valid option for production systems to support, for example, the procedure proposed? Thanks in advance for your answers. Best regards, -Alberto G. [1] http://mail-archives.apache.org/mod_mbox/geode-dev/201910.mbox/%3cb080e98c-5df4-e494-dcbd-383f6d979...@est.tech%3E