Re: About Geode rolling downgrade

Bruce Schuchardt Thu, 16 Apr 2020 08:05:04 -0700

-1

Another reason that we should not support rolling downgrade is that it makes it 
impossible to upgrade distributed algorithms.


When we added rolling upgrade support we pretty much immediately ran into a 
distributed hang when a test started a Locator using an older version.  In that 
release we also introduced the cluster configuration service and along with 
that we needed to upgrade the distributed lock service's notion of the "elder" 
member of the cluster.  Prior to that change a Locator could not fill this 
role, but the CCS needed to be able to use locking and needed a Locator to be 
able to fill this role.  During upgrade we used the old "elder" algorithm but 
once the upgrade was finished we switched to the new algorithm.  If you 
introduced an older Locator into this upgraded cluster it wouldn't think that 
it should be the "elder" but the rest of the cluster would expect it to be the 
elder.

You could support rolling downgrade in this scenario with extra logic and extra 
testing, but I don't think that will always be the case.  Rolling downgrade 
support would place an immense burden on developers in extra development and 
testing in order to ensure that older algorithms could always be brought back 
on-line.

On 4/16/20, 4:24 AM, "Alberto Gomez" <[email protected]> wrote:

    Hi,
    
    Some months ago I posted a question on this list (see [1]) about the 
possibility of supporting "rolling downgrade" in Geode in order to downgrade a 
Geode system to an older version, similar to the "rolling upgrade" currently 
supported.
    With your answers and my investigations my conclusion was that the main 
stumbling block to support "rolling downgrades" was the compatibility of 
persistent files which was very hard to achieve because old members would 
require to be prepared to support newer versions of persistent files.
    
    We have come up with a new approach to support rolling downgrades in Geode 
which consists of the following procedure:
    
    - For each locator:
      - Stop locator
      - Remove locator files
      - Start locator in older version
    
    - For each server:
      - Stop server
      - Remove server files
      - Revoke missing-disk-stores for server
      - Start server in older version
    
    Some extra details about this procedure:
    - The starting and stopping of processes may not be able to be done using 
gfsh as gfsh does not allow to manage members in a different version than its 
own.
    - Redundancy in servers is required
    - More than one locator is required
    - The allow_old_members_to_join_for_testing needs to be passed to the 
members.
    
    I would like to ask two questions regarding this procedure:
    - Do you see any issue not considered by this procedure or any alternative 
to it?
    - Would it be reasonable to make public the 
"allow_old_members_to_join_for_testing" parameter (with a new name) so that it 
might be valid option for production systems to support, for example, the 
procedure proposed?
    
    Thanks in advance for your answers.
    
    Best regards,
    
    -Alberto G.
    
    
    [1]
     
http://mail-archives.apache.org/mod_mbox/geode-dev/201910.mbox/%[email protected]%3E

Re: About Geode rolling downgrade

Reply via email to