DISCLAIMER: I am not knowledgeable about the Native Client (implementation) nor am I commenting specifically on the perf you are seeing, which can have many factors. However, in general...
Given you are performing "put" operations on a PR, then for consistency reasons, Geode is always going to "write" to the primary, on which ever member in the cluster hosts the primary for that particular PR (bucket). So, if the member containing the primary for the PR goes down, then I would expect it to take more time than a normal "put" when no member goes down. Essentially, the cluster is going to shuffle things around and possible rebalance the cluster in order to restore redundancy. When rebalancing, having collocated Regions could even further impact timing. When performing a "put" operation , having redundancy is not going to sustain or improve performance, if that was what you were expecting. In fact, it could even potentially negatively impact performance when a node goes down depending on the number of nodes and redundancy level. Finally, if you were testing "gets" vs "puts", then I'd expect very little if any noticeable impact on performance, since you are using redundant copies, which should fail over in the case of a node failure. Refer to the following sections in the User Guide for specfics: 1) Rebalancing PR Data: https://geode.apache.org/docs/guide/113/developing/partitioned_regions/rebalancing_pr_data.html (specifically, look at the section on 'How PR Rebalancing Works', which also talks about collocation). 2) Restoring Redundancy in PRs: https://geode.apache.org/docs/guide/113/developing/partitioned_regions/restoring_region_redundancy.html 3) Review your settings for 'member-timeout'. Search for this Geode property here: https://geode.apache.org/docs/guide/113/reference/topics/gemfire_properties.html). 4) Also, be mindful of the PR's 'recovery delay': https://geode.apache.org/releases/latest/javadoc/org/apache/geode/cache/PartitionAttributes.html#getRecoveryDelay-- There may be other server-side (cluster-wide) settings you can configure for node failures as well that I am not recalling off the top of my head. Hope this helps, -j ________________________________ From: Mario Salazar de Torres <mario.salazar.de.tor...@est.tech> Sent: Saturday, November 21, 2020 2:16 AM To: dev@geode.apache.org <dev@geode.apache.org> Cc: miguel.g.gar...@ericsson.com <miguel.g.gar...@ericsson.com> Subject: Requests taking too long if one member of the cluster fails Hi, I've been looking into the following issue: "Whenever performing a stress test on a Geode cluster and forcefully killing one of the members, all the threads in the application get stuck". To give more context these are the conditions under the test is performed: * A cluster is deployed with: * 2 locators. * 3 servers. * 2 partitioned regions are created and collocated with a third one (from now on called the "anchor"). * Also, regions have a single redundant copy configured. * Whether or not to enable persistence on these regions do not affect to the test outcome. * Note that we've configured a PartitionResolver for both of these regions. * A geode-native test application is spin up with 20 threads sending a pack of 1 put request to each of the partitioned regions regions (except for the "anchor"), all of that within a transaction. See example below to illustrate the kind of traffic sent: void thread() { while(true) { common_prefix = to_string(time(nullptr)); tx_manager->begin(); for(region_name : {"region_a", "region_b"}) { key = "key-" + common_prefix + "|" + to_string(rand()); value = to_string(rand()); cache->getRegion(region_name)->put(key, value); } tx_manager->commit(); } } The test consists of: * Spinning up the cluster. * Running the application. * One of the servers (from now on called "server-0") is forcefully restarted by using kill -KILL <PID> and after that starting it up again with gfsh. The expectation of this test is that given that data has a redundant copy, and we have 2 servers up and running all the time, then writing data should be handled smoothly. However, what actually happens is that all application threads end up being stuck. So, in the process of troubleshooting, we noticed that there was several dead-locks in the geode-native client, which resulted in the following PRs: * https://github.com/apache/geode-native/pull/660<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F660&data=04%7C01%7Cjblum%40vmware.com%7C18f6362abffe44fb21e008d88e069818%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637415506267358205%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=SHG7IdzZIJHeZf4IEm4LJZZgCMFEzbDB1N0oULHwF4I%3D&reserved=0> * https://github.com/apache/geode-native/pull/676<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F676&data=04%7C01%7Cjblum%40vmware.com%7C18f6362abffe44fb21e008d88e069818%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637415506267368153%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=qKk4AgGNnN%2FOKLigzXAU85ouk%2BQ7ZW2uM213AUTpYaA%3D&reserved=0> * https://github.com/apache/geode-native/pull/699<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F699&data=04%7C01%7Cjblum%40vmware.com%7C18f6362abffe44fb21e008d88e069818%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637415506267368153%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=hd3U2zlgIYfH4hkTNj6uJwnnk8CwdutV%2Ful9JAxXlJo%3D&reserved=0> After solving all dead-locks in the client-side, we were still noticing the same outcome in the test. So, after more digging, there it is what we noticed: * Once the server is killed, geode-native removes the server endpoint from the ClientMetadataService. * But given that put requests can be only executed on the server holding the primary copy, these requests ended up being proxied towards the server that was just killed. * As it takes some time for the cluster members to notice that other members are down, requests proxied trough "healthy" servers take longer than expected. Something between 5-30 seconds. * So, in the end, all the threads are stuck for this interval of time because the server they are contacting, are contacting "server-0". For the sake of clarity I've attached a diagram demonstrating the test scenario. Let me know any additional clarifications you might need to understand the test itself. And now, my questions here are: * Have you encountered this behavior before? And if so, how did you solved that? * Is this expected behavior? And if so, what's the point of having a cluster of several members with partitioned redundant data? Sorry for the long reading and thanks for any help you can throw in. BR, Mario.