Re: About Geode rolling downgrade

alberto.gomez Mon, 20 Apr 2020 09:14:35 -0700

Hi,

I agree that if we wanted to support limited rolling downgrade some other 
version interchange needs to be done and extra tests will be required.


Nevertheless, this could be done using gfsh or with a startup parameter. For 
example, in the case you mentioned about the UDP messaging, some command like: 
"enable UDP messaging" to put the system again in a state equivalent to 
"upgrade in progress but not yet completed" that would allow old members to 
join again.
I guess for each case there would be particularities but they should not 
involve a lot of effort because most of the mechanisms needed (the ones that 
allow old and new members to coexist) will have been developed for the rolling 
upgrade.

Anyhow, we wonder what would be as of today the recommended or official way to 
downgrade a Geode system without downtime and data loss?


________________________________
From: Bruce Schuchardt <[email protected]>
Sent: Friday, April 17, 2020 11:36 PM
To: [email protected] <[email protected]>
Subject: Re: About Geode rolling downgrade

Hi Alberto,

I think that if we want to support limited rolling downgrade some other version 
interchange needs to be done and there need to be tests that prove that the 
downgrade works.  That would let us document which versions are compatible for 
a downgrade and enforce that no-one attempts it between incompatible versions.

For instance, there is work going on right now that introduces communications 
changes to remove UDP messaging.  Once rolling upgrade completes it will shut 
down unsecure UDP communications.  At that point there is no way to go back.  
If you tried it the old servers would try to communicate with UDP but the new 
servers would not have UDP sockets open for security reasons.

As a side note, clients would all have to be rolled back before starting in on 
the servers.  Clients aren't equipped to talk to an older version server, and 
servers will reject the client's attempts to create connections.

On 4/17/20, 10:14 AM, "Alberto Gomez" <[email protected]> wrote:

    Hi Bruce,

    Thanks a lot for your answer. We had not thought about the changes in 
distributed algorithms when analyzing rolling downgrades.

    Rolling downgrade is a pretty important requirement for our customers so we 
would not like to close the discussion here and instead try to see if it is 
still reasonable to propose it for Geode maybe relaxing a bit the expectations 
and clarifying some things.

    First, I think supporting rolling downgrade does not mean making it 
impossible to upgrade distributed algorithms. It means that you need to support 
the new and the old algorithms (just as it is done today with rolling upgrades) 
in the upgraded version and also support the possibility of switching to the 
old algorithm in a fully upgraded system.

    Second of all, I would say it is not very common to upgrade distributed 
algorithms, or at least, it does not seem to have been the case so far in 
Geode. Therefore, the burden of adding the logic to support the rolling 
downgrade would not be something to be carried in every release. In my opinion, 
it will be some extra percentage of work to be added to the work to support the 
rolling upgrade of the algorithm as the rolling downgrade will probably be 
using the mechanisms implemented for the rolling upgrade.

    Third, we do not need to support the rolling downgrade from any release to 
any other older release. We could just support the rolling downgrade (at least 
when distributed algorithms are changed) between consecutive versions. They 
could be considered special cases like those when it is required to provide a 
tool to convert files in order to assure compatibility.

    -Alberto


    ________________________________
    From: Bruce Schuchardt <[email protected]>
    Sent: Thursday, April 16, 2020 5:04 PM
    To: [email protected] <[email protected]>
    Subject: Re: About Geode rolling downgrade

    -1

    Another reason that we should not support rolling downgrade is that it 
makes it impossible to upgrade distributed algorithms.

    When we added rolling upgrade support we pretty much immediately ran into a 
distributed hang when a test started a Locator using an older version.  In that 
release we also introduced the cluster configuration service and along with 
that we needed to upgrade the distributed lock service's notion of the "elder" 
member of the cluster.  Prior to that change a Locator could not fill this 
role, but the CCS needed to be able to use locking and needed a Locator to be 
able to fill this role.  During upgrade we used the old "elder" algorithm but 
once the upgrade was finished we switched to the new algorithm.  If you 
introduced an older Locator into this upgraded cluster it wouldn't think that 
it should be the "elder" but the rest of the cluster would expect it to be the 
elder.

    You could support rolling downgrade in this scenario with extra logic and 
extra testing, but I don't think that will always be the case.  Rolling 
downgrade support would place an immense burden on developers in extra 
development and testing in order to ensure that older algorithms could always 
be brought back on-line.

    On 4/16/20, 4:24 AM, "Alberto Gomez" <[email protected]> wrote:

        Hi,

        Some months ago I posted a question on this list (see [1]) about the 
possibility of supporting "rolling downgrade" in Geode in order to downgrade a 
Geode system to an older version, similar to the "rolling upgrade" currently 
supported.
        With your answers and my investigations my conclusion was that the main 
stumbling block to support "rolling downgrades" was the compatibility of 
persistent files which was very hard to achieve because old members would 
require to be prepared to support newer versions of persistent files.

        We have come up with a new approach to support rolling downgrades in 
Geode which consists of the following procedure:

        - For each locator:
          - Stop locator
          - Remove locator files
          - Start locator in older version

        - For each server:
          - Stop server
          - Remove server files
          - Revoke missing-disk-stores for server
          - Start server in older version

        Some extra details about this procedure:
        - The starting and stopping of processes may not be able to be done 
using gfsh as gfsh does not allow to manage members in a different version than 
its own.
        - Redundancy in servers is required
        - More than one locator is required
        - The allow_old_members_to_join_for_testing needs to be passed to the 
members.

        I would like to ask two questions regarding this procedure:
        - Do you see any issue not considered by this procedure or any 
alternative to it?
        - Would it be reasonable to make public the 
"allow_old_members_to_join_for_testing" parameter (with a new name) so that it 
might be valid option for production systems to support, for example, the 
procedure proposed?

        Thanks in advance for your answers.

        Best regards,

        -Alberto G.


        [1]
         
http://mail-archives.apache.org/mod_mbox/geode-dev/201910.mbox/%[email protected]%3E

Re: About Geode rolling downgrade

Reply via email to