Re: Requests taking too long if one member of the cluster fails

Mario Salazar de Torres Sat, 21 Nov 2020 13:40:47 -0800

Thanks @John Blum<mailto:[email protected]> for your detailed explanation! It 
helped me to better understand how redundancy works.


Thing is that all our use cases requires a really low response time when 
performing operations.
Under normal conditions a "put" takes a few milliseconds, but in the case of a 
cluster member going down, in the described scenario it might take up to 30 
seconds, sometimes even more.
Things we've considered is to set a timeout in the client-side, but still, upon 
retrials it will face the same issue.

What I've noticed is that requests being proxied won't stop to be sent to the 
failed server until:

  1.  One locator (I'd say the coordinator, please correct my if I am wrong 
here) does a final health check towards the member and member-timeout elapses 
without a successful ping.
  2.  And one of the members holding a secondary copy of the buckets volunteers 
to become primary owner.
  3.  Server becomes the primary owner.

So, what I've tried here is to set a really low member-timeout, which results 
the server holding the secondary copy becoming the primary owner in around 
<600ms. That's quite a huge improvement,
but I wanted to ask you if setting this member-timeout too low might carry 
unforeseen consequences.

As for a long term solutions to this in order to remove/significantly reduce 
any impact upon a server failure I've been thinking of the following:

  1.  In the last ApacheConf someone asked why "put" is only done in servers 
holding a primary copy and why not remove this constraint? My question here, 
out of ignorance. Is this such a crazy idea?
I've seen redis has something called partial re-sync to solve split-brain 
scenarios that would be caused by writing into the secondary while the primary 
is down.
  2.  Another alternative I've been thinking is upon a connection failure 
(either a read ack io exception or connection refused) there could be an option 
for the
server owning a secondary copy that, if enabled, would volunteer to become 
primary owner straight away.

NOTE. I am more familiarized with the native client, so please feel free to 
correct me If I got something wrong, or if any of what I've written is a "bunch 
of gibberish" with no sense at all 🙂
BTW. I am not sure that the test diagram was attached to the mail, so I've also 
uploaded it here: https://i.ibb.co/G7n6T0M/Geode-Server-Kill.jpg

Thanks again.
BR,
Mario
________________________________
From: John Blum <[email protected]>
Sent: Saturday, November 21, 2020 9:41 PM
To: [email protected] <[email protected]>
Cc: [email protected] <[email protected]>
Subject: Re: Requests taking too long if one member of the cluster fails

DISCLAIMER: I am not knowledgeable about the Native Client (implementation) nor 
am I commenting specifically on the perf you are seeing, which can have many 
factors. However, in general...

Given you are performing "put" operations on a PR, then for consistency 
reasons, Geode is always going to "write" to the primary, on which ever member 
in the cluster hosts the primary for that particular PR (bucket).  So, if the 
member containing the primary for the PR goes down, then I would expect it to 
take more time than a normal "put" when no member goes down. Essentially, the 
cluster is going to shuffle things around and possible rebalance the cluster in 
order to restore redundancy. When rebalancing, having collocated Regions could 
even further impact timing.

When performing a "put" operation , having redundancy is not going to sustain 
or improve performance, if that was what you were expecting. In fact, it could 
even potentially negatively impact performance when a node goes down depending 
on the number of nodes and redundancy level.

Finally, if you were testing "gets" vs "puts", then I'd expect very little if 
any noticeable impact on performance, since you are using redundant copies, 
which should fail over in the case of a node failure.

Refer to the following sections in the User Guide for specfics:

1) Rebalancing PR Data: 
https://geode.apache.org/docs/guide/113/developing/partitioned_regions/rebalancing_pr_data.html
 (specifically, look at the section on 'How PR Rebalancing Works', which also 
talks about collocation).

2) Restoring Redundancy in PRs: 
https://geode.apache.org/docs/guide/113/developing/partitioned_regions/restoring_region_redundancy.html

3) Review your settings for 'member-timeout'. Search for this Geode property 
here:
https://geode.apache.org/docs/guide/113/reference/topics/gemfire_properties.html).


4) Also, be mindful of the PR's 'recovery delay':
https://geode.apache.org/releases/latest/javadoc/org/apache/geode/cache/PartitionAttributes.html#getRecoveryDelay--


There may be other server-side (cluster-wide) settings you can configure for 
node failures as well that I am not recalling off the top of my head.

Hope this helps,

-j

________________________________
From: Mario Salazar de Torres <[email protected]>
Sent: Saturday, November 21, 2020 2:16 AM
To: [email protected] <[email protected]>
Cc: [email protected] <[email protected]>
Subject: Requests taking too long if one member of the cluster fails

Hi,

I've been looking into the following issue:
"Whenever performing a stress test on a Geode cluster and forcefully killing 
one of the members, all the threads in the application get stuck".

To give more context these are the conditions under the test is performed:

  *   A cluster is deployed with:
     *   2 locators.
     *   3 servers.
  *   2 partitioned regions are created and collocated with a third one (from 
now on called the "anchor").
     *   Also, regions have a single redundant copy configured.
     *   Whether or not to enable persistence on these regions do not affect to 
the test outcome.
     *   Note that we've configured a PartitionResolver for both of these 
regions.
  *   A geode-native test application is spin up with 20 threads sending a pack 
of 1 put request to each of the partitioned
regions regions (except for the "anchor"), all of that within a transaction. 
See example below to illustrate the kind of traffic sent:
void thread() {
  while(true) {
    common_prefix = to_string(time(nullptr));
    tx_manager->begin();
    for(region_name : {"region_a", "region_b"}) {
      key = "key-" + common_prefix + "|" + to_string(rand());
      value = to_string(rand());
      cache->getRegion(region_name)->put(key, value);
    }
    tx_manager->commit();
  }
}

The test consists of:

  *   Spinning up the cluster.
  *   Running the application.
  *   One of the servers (from now on called "server-0") is forcefully 
restarted by
using kill -KILL <PID> and after that starting it up again with gfsh.

The expectation of this test is that given that data has a redundant copy, and 
we have 2 servers up and running all the time, then writing data should be 
handled smoothly.
However, what actually happens is that all application threads end up being 
stuck.

So, in the process of troubleshooting, we noticed that there was several 
dead-locks in the geode-native client, which resulted in the following PRs:

  *   
https://github.com/apache/geode-native/pull/660<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F660&data=04%7C01%7Cjblum%40vmware.com%7C18f6362abffe44fb21e008d88e069818%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637415506267358205%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=SHG7IdzZIJHeZf4IEm4LJZZgCMFEzbDB1N0oULHwF4I%3D&reserved=0>
  *   
https://github.com/apache/geode-native/pull/676<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F676&data=04%7C01%7Cjblum%40vmware.com%7C18f6362abffe44fb21e008d88e069818%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637415506267368153%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=qKk4AgGNnN%2FOKLigzXAU85ouk%2BQ7ZW2uM213AUTpYaA%3D&reserved=0>
  *   
https://github.com/apache/geode-native/pull/699<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F699&data=04%7C01%7Cjblum%40vmware.com%7C18f6362abffe44fb21e008d88e069818%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637415506267368153%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=hd3U2zlgIYfH4hkTNj6uJwnnk8CwdutV%2Ful9JAxXlJo%3D&reserved=0>

After solving all dead-locks in the client-side, we were still noticing the 
same outcome in the test.
So, after more digging, there it is what we noticed:

  *   Once the server is killed, geode-native removes the server endpoint from 
the ClientMetadataService.
  *   But given that put requests can be only executed on the server holding 
the primary copy, these requests ended up being proxied towards the server that 
was just killed.
  *   As it takes some time for the cluster members to notice that other 
members are down, requests proxied trough "healthy" servers take longer than 
expected. Something between 5-30 seconds.
  *   So, in the end, all the threads are stuck for this interval of time 
because the server they are contacting, are contacting "server-0".

For the sake of clarity I've attached a diagram demonstrating the test 
scenario. Let me know any additional clarifications you might need to 
understand the test itself.

And now, my questions here are:

  *   Have you encountered this behavior before? And if so, how did you solved 
that?
  *   Is this expected behavior? And if so, what's the point of having a 
cluster of several members with partitioned redundant data?

Sorry for the long reading and thanks for any help you can throw in.

BR,
Mario.

Re: Requests taking too long if one member of the cluster fails

Reply via email to