Odg: Colocated regions missing some buckets after restart
Hi Anil, From server logs we see that have some threads stucked and continuosly get on server2 the following message(bucket missing on server2 for DfSessions region): [warn 2020/09/15 14:25:39.852 CEST tid=0x251] 15 secs have elapsed waiting for a primary for bucket [BucketAdvisor /__PR/_B__DfSessions_18:935: state=VOLUNTEERING_HOSTING]. Current bucket owners [] And on the other server1: [warn 2020/09/15 14:25:40.852 CEST tid=0xdf] 15 seconds have elapsed while waiting for replies: :41003]> on 192.168.0.145(server1:28031):41002 whose current membership list is: [[192.168.0.145(locator1:27244:locator):41000, 192.168.0.145(locator2:27343:locator):41001, 192.168.0.145(server1:28031):41002, 192.168.0.145(server2:28054):41003]] [warn 2020/09/15 14:27:20.200 CEST tid=0x11] Thread 223 (0xdf) is stuck [warn 2020/09/15 14:27:20.202 CEST tid=0x11] Thread <223> (0xdf) that was executed at <15 Sep 2020 14:25:24 CEST> has been stuck for <115.361 seconds> and number of thread monitor iteration <1> Thread Name state ... It seems that this is not problem with stats. We have a some suspicion that the problem is with some lock, but we need to investigate it a bit more. BR, Mario Šalje: Anilkumar Gingade Poslano: 15. rujna 2020. 16:36 Prima: dev@geode.apache.org Predmet: Re: Colocated regions missing some buckets after restart Mario, I doubt this has anything to do with the client connections. If it is it should be between server/member to server/member connection; in that case the unresponsive member is kicked out from the cluster. The recommended configuration is to have persistence regions for both parent and co-located regions (and replicated regions)... There could be issues in the stats too...Can you try executing a test/validation code on server side to dump/list primary and secondary buckets. You can do that using helper methods: pr.getDataStore().getAllLocalPrimaryBucketIds(); -Anil On 9/14/20, 12:25 AM, "Mario Kevo" wrote: Hi, This problem is usually seen only on 1 server. The other servers metrics and bucket count looks fine. Another symptom of this issue is that the max-connections limit is reached on the problematic server if we have a client that tries to reconnect after the server restart. Clients simply get no response from the server so they try to close the connection, but the connection close is not acknowledged by the server. On server side we see that the connections are in CLOSE-WAIT state with packets in the socket receiver queue. It’s as if the servers just stopped processing packets on the sockets while waiting for a member with the primary bucket. So in short, each new client connection is “unresponsive”. The client tries to close it a open a new one, but the socket doesn’t get closed on server side and the connection is left “hanging” on the server. Clients will try to do this until max-connections is reached on the servers. This is why we would be unable to add any data to the regions. But IMHO it’s really not dependent on adding data, since this issue happens occasionally (1 out of ~4 restarts) and only on one server. The initial problem was observed with a persistent region A (with 1 key-value pairs inserted) and a non-persistent region B collocated with region A. We did some tests with both regions being persistent. We haven’t observed the same issue yet (although we did only a few restarts), but we observed something that also looks quite worrying. Both servers start up without reporting issues in the logs. But, looking at the server metrics, one server has wrong information about “bucketCount” and is missing primary buckets. E.g: First server: Partition | putLocalRate | 0.0 | putRemoteRate| 0.0 | putRemoteLatency | 0 | putRemoteAvgLatency | 0 | bucketCount | 113 | primaryBucketCount | 57 Second server: Partition | putLocalRate | 0.0 | putRemoteRate| 0.0 | putRemoteLatency | 0 | putRemoteAvgLatency | 0 | bucketCount | 111 | primaryBucketCount | 55 So we are missing a primary bucket without being aware of the issue. BR, Mario Šalje: Anilkumar Gingade Poslano: 11. rujna 2020. 20:34 Prima: dev@geode.apache.org Predmet: Re: Colocated regions missing some buckets after restart Are you seeing no-buckets for persistent regions or non-persistent. The buckets are created dynamically; when data is added to corresponding buckets... When server is restarted, in case of in-memory regions as the data is not there, the bucket region may not have been created (my suspicion). Can you try adding data and see if the co-located bucket region gets c
Re: Colocated regions missing some buckets after restart
Mario, Take a thread dump; couple of times at an interval of a minute...See if you can find threads stuck in region creation...This will show if there are any lock contention. -Anil. On 9/16/20, 6:29 AM, "Mario Kevo" wrote: Hi Anil, From server logs we see that have some threads stucked and continuosly get on server2 the following message(bucket missing on server2 for DfSessions region): [warn 2020/09/15 14:25:39.852 CEST tid=0x251] 15 secs have elapsed waiting for a primary for bucket [BucketAdvisor /__PR/_B__DfSessions_18:935: state=VOLUNTEERING_HOSTING]. Current bucket owners [] And on the other server1: [warn 2020/09/15 14:25:40.852 CEST tid=0xdf] 15 seconds have elapsed while waiting for replies: :41003]> on 192.168.0.145(server1:28031):41002 whose current membership list is: [[192.168.0.145(locator1:27244:locator):41000, 192.168.0.145(locator2:27343:locator):41001, 192.168.0.145(server1:28031):41002, 192.168.0.145(server2:28054):41003]] [warn 2020/09/15 14:27:20.200 CEST tid=0x11] Thread 223 (0xdf) is stuck [warn 2020/09/15 14:27:20.202 CEST tid=0x11] Thread <223> (0xdf) that was executed at <15 Sep 2020 14:25:24 CEST> has been stuck for <115.361 seconds> and number of thread monitor iteration <1> Thread Name state ... It seems that this is not problem with stats. We have a some suspicion that the problem is with some lock, but we need to investigate it a bit more. BR, Mario Šalje: Anilkumar Gingade Poslano: 15. rujna 2020. 16:36 Prima: dev@geode.apache.org Predmet: Re: Colocated regions missing some buckets after restart Mario, I doubt this has anything to do with the client connections. If it is it should be between server/member to server/member connection; in that case the unresponsive member is kicked out from the cluster. The recommended configuration is to have persistence regions for both parent and co-located regions (and replicated regions)... There could be issues in the stats too...Can you try executing a test/validation code on server side to dump/list primary and secondary buckets. You can do that using helper methods: pr.getDataStore().getAllLocalPrimaryBucketIds(); -Anil On 9/14/20, 12:25 AM, "Mario Kevo" wrote: Hi, This problem is usually seen only on 1 server. The other servers metrics and bucket count looks fine. Another symptom of this issue is that the max-connections limit is reached on the problematic server if we have a client that tries to reconnect after the server restart. Clients simply get no response from the server so they try to close the connection, but the connection close is not acknowledged by the server. On server side we see that the connections are in CLOSE-WAIT state with packets in the socket receiver queue. It’s as if the servers just stopped processing packets on the sockets while waiting for a member with the primary bucket. So in short, each new client connection is “unresponsive”. The client tries to close it a open a new one, but the socket doesn’t get closed on server side and the connection is left “hanging” on the server. Clients will try to do this until max-connections is reached on the servers. This is why we would be unable to add any data to the regions. But IMHO it’s really not dependent on adding data, since this issue happens occasionally (1 out of ~4 restarts) and only on one server. The initial problem was observed with a persistent region A (with 1 key-value pairs inserted) and a non-persistent region B collocated with region A. We did some tests with both regions being persistent. We haven’t observed the same issue yet (although we did only a few restarts), but we observed something that also looks quite worrying. Both servers start up without reporting issues in the logs. But, looking at the server metrics, one server has wrong information about “bucketCount” and is missing primary buckets. E.g: First server: Partition | putLocalRate | 0.0 | putRemoteRate| 0.0 | putRemoteLatency | 0 | putRemoteAvgLatency | 0 | bucketCount | 113 | primaryBucketCount | 57 Second server: Partition | putLocalRate | 0.0 | putRemoteRate| 0.0 | putRemoteLatency | 0 | putRemoteAvgLatency | 0 | bucketCount | 111 | primaryBucketCount | 55 So we are missing a primary bucket without being aware of the issue. BR, Mario Šalje: Anilkumar Gingade Poslano: 11. rujna 2020. 20:34 Prima: dev@geode.apache.org Predmet: