[ 
https://issues.apache.org/jira/browse/GEODE-10215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17518097#comment-17518097
 ] 

Jakov Varenina edited comment on GEODE-10215 at 4/6/22 12:50 PM:
-----------------------------------------------------------------

Hi [~agingade] ,

I have tried to troubleshoot this issue, but unfortunately, I haven't found the 
root cause yet. I could use some help on this, and if you are interested in 
helping, you can find the test case that reproduces the issue in linked PR.

I have found the following so far:

When you configure a parallel gateway-sender on a partitioned region, the 
parallel gateway-sender queue is created as a shadow collocated region of the 
primary data partition region. Because of that, the set of primary buckets IDs 
of the data region is the same as the primary bucket IDs of the queue region. 
The aforementioned is also valid for the secondary buckets. When you alter the 
region to remove the gateway-sender, you only remove the connection between the 
region and the gateway-sender. Still, the parallel gateway-sender queue region 
remains on servers. Recreating the region again with the same gateway sender 
results in new data buckets distributed differently across the servers and this 
affects also queue region buckets since they are tightly coupled. After this 
bucket advisor for queue region is not working correctly and causing some 
events not to be dispatched to the remote site.

Lets go deeper in the logs:

Distribution of primary buckets first time region is created:
{code:java}
[vm6] Jale buckets: [1, 7, 11, 15, 17, 20, 27, 29, 34, 38, 42, 46, 48, 53, 59, 
60, 67, 70, 72, 76, 80, 85, 91, 92, 99, 103, 104, 108]
[vm7] Jale buckets: [3, 5, 8, 12, 19, 21, 24, 30, 32, 39, 40, 47, 49, 54, 56, 
61, 66, 71, 75, 79, 81, 84, 90, 94, 98, 101, 107, 111]
[vm10] Jale buckets:[0, 4, 9, 13, 16, 22, 26, 28, 35, 36, 43, 44, 51, 55, 57, 
63, 65, 68, 73, 78, 82, 86, 89, 93, 96, 102, 105, 109, 112]
[vm11] Jale buckets:[2, 6, 10, 14, 18, 23, 25, 31, 33, 37, 41, 45, 50, 52, 58, 
62, 64, 69, 74, 77, 83, 87, 88, 95, 97, 100, 106, 110]{code}
Distribution of primary buckets after region is recreated:
{code:java}
[vm6] Jale buckets: [0, 4, 8, 14, 17, 20, 25, 27, 34, 36, 40, 46, 51, 54, 56, 
63, 66, 71, 72, 77, 82, 85, 91, 95, 97, 103, 107, 108]
[vm7] Jale buckets: [2, 5, 7, 11, 15, 21, 26, 28, 31, 38, 42, 44, 47, 50, 52, 
57, 60, 67, 70, 74, 79, 83, 87, 90, 93, 98, 101, 105, 110]
[vm10] Jale buckets: [3, 10, 13, 16, 22, 23, 30, 32, 37, 39, 43, 48, 55, 58, 
62, 64, 68, 73, 78, 81, 84, 88, 92, 99, 100, 106, 109, 112]
[vm11] Jale buckets: [1, 6, 9, 12, 18, 19, 24, 29, 33, 35, 41, 45, 49, 53, 59, 
61, 65, 69, 75, 76, 80, 86, 89, 94, 96, 102, 104, 111]{code}
In above logs it can be seen that buckets are distributed differently after 
region is re-created.

Events (keys) that did not replicate to remote site:
{code:java}
[vm9] Jale result: [500, 501, 502, 503, 504, 506, 508, 513, 514, 515, 517, 518, 
527, 528, 529, 539, 541, 544, 545, 547, 548, 549, 551, 556, 557, 558, 559, 563, 
565, 566, 567, 568, 569, 572, 573, 574, 575, 576, 577, 579, 580, 588, 589, 594, 
595, 597, 600, 602, 603, 605, 609, 613, 614, 615, 616, 617, 619, 621, 626, 627, 
628, 630, 631, 640, 641, 642, 652, 654, 657, 658, 660, 661, 662, 664, 669, 670, 
671, 672, 676, 678, 679, 680, 681, 682, 685, 686, 687, 688, 689, 690, 692, 693, 
701, 702, 707, 708, 710, 713, 715, 716, 718, 722, 726, 727, 728, 729, 730, 732, 
734, 739, 740, 741, 743, 744, 753, 754, 755, 765, 767, 770, 771, 773, 774, 775, 
777, 782, 783, 784, 785, 789, 791, 792, 793, 794, 795, 798, 799, 800, 801, 802, 
803, 805, 806, 814, 815, 820, 821, 823, 826, 828, 829, 831, 835, 839, 840, 841, 
842, 843, 845, 847, 852, 853, 854, 856, 857, 866, 867, 868, 878, 880, 883, 884, 
886, 887, 888, 890, 895, 896, 897, 898, 902, 904, 905, 906, 907, 908, 911, 912, 
913, 914, 915, 916, 918, 919, 927, 928, 933, 934, 936, 939, 941, 942, 944, 948, 
952, 953, 954, 955, 956, 958, 960, 965, 966, 967, 969, 970, 979, 980, 981, 991, 
993, 996, 997, 999, 1000, 1001, 1003, 1008, 1009, 1010, 1011, 1015, 1017, 1018, 
1019, 1020, 1021, 1024, 1025, 1026, 1027, 1028, 1029, 1031, 1032, 1040, 1041, 
1046, 1047, 1049, 1052, 1054, 1055, 1057, 1061, 1065, 1066, 1067, 1068, 1069, 
1071, 1073, 1078, 1079, 1080, 1082, 1083, 1092, 1093, 1094, 1104, 1106, 1109, 
1110, 1112, 1113, 1114, 1116, 1121, 1122, 1123, 1124, 1128, 1130, 1131, 1132, 
1133, 1134, 1137, 1138, 1139, 1140, 1141, 1142, 1144, 1145, 1153, 1154, 1159, 
1160, 1162, 1165, 1167, 1168, 1170, 1174, 1178, 1179, 1180, 1181, 1182, 1184, 
1186, 1191, 1192, 1193, 1195, 1196, 1205, 1206, 1207, 1211, 1217, 1219, 1222, 
1223, 1225, 1226, 1227, 1229, 1231, 1234, 1235, 1236, 1237, 1241, 1243, 1244, 
1245, 1246, 1247, 1250, 1251, 1252, 1253, 1254, 1255, 1257, 1258, 1264, 1266, 
1267, 1272, 1273, 1275, 1278, 1280, 1281, 1283, 1287, 1291, 1292, 1293, 1294, 
1295, 1297, 1299, 1304, 1305, 1306, 1308, 1309, 1318, 1319, 1320, 1330, 1332, 
1335, 1336, 1338, 1339, 1340, 1342, 1343, 1347, 1348, 1349, 1350, 1353, 1354, 
1356, 1357, 1358, 1359, 1360, 1363, 1364, 1365, 1366, 1367, 1368, 1370, 1371, 
1379, 1380, 1381, 1385, 1386, 1388, 1389, 1391, 1393, 1394, 1396, 1400, 1401, 
1403, 1404, 1405, 1406, 1407, 1408, 1410, 1411, 1412, 1414, 1417, 1418, 1419, 
1421, 1422, 1431, 1432, 1433, 1439, 1443, 1444, 1445, 1448, 1449, 1451, 1452, 
1453, 1455, 1456, 1460, 1461, 1462, 1463, 1466, 1467, 1469, 1470, 1471, 1472, 
1473, 1476, 1477, 1478, 1479, 1480, 1481, 1483, 1484, 1487, 1492, 1493, 1494, 
1498, 1499]{code}
Below you can see logs for one key that didn't replicate is: *key=513* which 
was intended for he bucket *bucketId=61* :

*[vm10]* [warn 2022/04/06 13:51:17.514 CEST server-10 <P2P message reader for 
jakov(server-11:3996)<v4>:41010 unshared ordered sender uid=42 dom #2 local 
port=60262 remote port=59158> tid=0xc1|#2 local port=60262 remote port=59158> 
tid=0xc1] *Put successfully in the secondary queue* : 
GatewaySenderEventImpl[id=EventID[id=25 
bytes;threadID=0x1003d|869317;sequenceID=513;bucketId=61];action=0;operation=CREATE;region=/test1;{*}key=513;value=513{*};valueIsObject=1;numberOfParts=9;callbackArgument=GatewaySenderEventCallbackArgument
 
[originalCallbackArg=null;originatingSenderId=2;recipientGatewayReceivers=\\{1}];possibleDuplicate=false;creationTime=1649245877513;shadowKey=626;timeStamp=1649245877499;acked=false;dispatched=false;bucketId=61;isConcurrencyConflict=false;transactionId=null;isLastEventInTransaction=false]
 and bucket: true was initialized: {}

*[vm11]* [warn 2022/04/06 13:51:17.517 CEST server-11 <P2P message reader for 
jakov(server-7:2520)<v2>:41008 unshared ordered sender uid=38 dom #1 local 
port=56317 remote port=60624> tid=0xa5|#1 local port=56317 remote port=60624> 
tid=0xa5] *Put successfully in the secondary queue* : 
GatewaySenderEventImpl[id=EventID[id=25 
bytes;threadID=0x1003d|869317;sequenceID=513;bucketId=61];action=0;operation=CREATE;region=/test1;{*}key=513;value=513{*};valueIsObject=1;numberOfParts=9;callbackArgument=GatewaySenderEventCallbackArgument
 
[originalCallbackArg=null;originatingSenderId=2;recipientGatewayReceivers=\\{1}];possibleDuplicate=false;creationTime=1649245877515;shadowKey=626;timeStamp=1649245877499;acked=false;dispatched=false;bucketId=61;isConcurrencyConflict=false;transactionId=null;isLastEventInTransaction=false]
 and bucket: true was initialized: {}

Also in above logs it can be seen that bucket advisor interprets primary queue 
bucket on VM11 server as a secondary. VM11 server after region is created has 
become the primary server for that bucket. Before that, the VM7 server was the 
primary. Not sure yet why this happens, and I continue to investigate.


was (Author: jvarenina):
Hi [~agingade] ,

I have tried to troubleshoot this issue, but unfortunately, I haven't found the 
root cause yet. I could use some help on this, and if you are interested in 
helping, you can find the test case that reproduces the issue in linked PR.

I have found the following so far:

When you configure a parallel gateway-sender on a partitioned region, the 
parallel gateway-sender queue is created as a shadow collocated region of the 
primary data partition region. Because of that, the set of primary buckets IDs 
of the data region is the same as the primary bucket IDs of the queue region. 
The aforementioned is also valid for the secondary buckets. When you alter the 
region to remove the gateway-sender, you only remove the connection between the 
region and the gateway-sender. Still, the parallel gateway-sender queue region 
remains on servers. Recreating the region again with the same gateway sender 
results in new data buckets distributed differently across the servers. After 
this bucket advisor is not working correctly and causing some events not to be 
dispatched to the remote site.


Lets go deeper in the logs:

Distribution of primary buckets first time region is created:

 
{code:java}
[vm6] Jale buckets: [1, 7, 11, 15, 17, 20, 27, 29, 34, 38, 42, 46, 48, 53, 59, 
60, 67, 70, 72, 76, 80, 85, 91, 92, 99, 103, 104, 108]
[vm7] Jale buckets: [3, 5, 8, 12, 19, 21, 24, 30, 32, 39, 40, 47, 49, 54, 56, 
61, 66, 71, 75, 79, 81, 84, 90, 94, 98, 101, 107, 111]
[vm10] Jale buckets:[0, 4, 9, 13, 16, 22, 26, 28, 35, 36, 43, 44, 51, 55, 57, 
63, 65, 68, 73, 78, 82, 86, 89, 93, 96, 102, 105, 109, 112]
[vm11] Jale buckets:[2, 6, 10, 14, 18, 23, 25, 31, 33, 37, 41, 45, 50, 52, 58, 
62, 64, 69, 74, 77, 83, 87, 88, 95, 97, 100, 106, 110]{code}
 

Distribution of primary buckets after region is recreated:
{code:java}
[vm6] Jale buckets: [0, 4, 8, 14, 17, 20, 25, 27, 34, 36, 40, 46, 51, 54, 56, 
63, 66, 71, 72, 77, 82, 85, 91, 95, 97, 103, 107, 108]
[vm7] Jale buckets: [2, 5, 7, 11, 15, 21, 26, 28, 31, 38, 42, 44, 47, 50, 52, 
57, 60, 67, 70, 74, 79, 83, 87, 90, 93, 98, 101, 105, 110]
[vm10] Jale buckets: [3, 10, 13, 16, 22, 23, 30, 32, 37, 39, 43, 48, 55, 58, 
62, 64, 68, 73, 78, 81, 84, 88, 92, 99, 100, 106, 109, 112]
[vm11] Jale buckets: [1, 6, 9, 12, 18, 19, 24, 29, 33, 35, 41, 45, 49, 53, 59, 
61, 65, 69, 75, 76, 80, 86, 89, 94, 96, 102, 104, 111]{code}
In above logs it can be seen that buckets are distributed differently after 
region is re-created.

Events (keys) that did not replicate to remote site:
{code:java}
[vm9] Jale result: [500, 501, 502, 503, 504, 506, 508, 513, 514, 515, 517, 518, 
527, 528, 529, 539, 541, 544, 545, 547, 548, 549, 551, 556, 557, 558, 559, 563, 
565, 566, 567, 568, 569, 572, 573, 574, 575, 576, 577, 579, 580, 588, 589, 594, 
595, 597, 600, 602, 603, 605, 609, 613, 614, 615, 616, 617, 619, 621, 626, 627, 
628, 630, 631, 640, 641, 642, 652, 654, 657, 658, 660, 661, 662, 664, 669, 670, 
671, 672, 676, 678, 679, 680, 681, 682, 685, 686, 687, 688, 689, 690, 692, 693, 
701, 702, 707, 708, 710, 713, 715, 716, 718, 722, 726, 727, 728, 729, 730, 732, 
734, 739, 740, 741, 743, 744, 753, 754, 755, 765, 767, 770, 771, 773, 774, 775, 
777, 782, 783, 784, 785, 789, 791, 792, 793, 794, 795, 798, 799, 800, 801, 802, 
803, 805, 806, 814, 815, 820, 821, 823, 826, 828, 829, 831, 835, 839, 840, 841, 
842, 843, 845, 847, 852, 853, 854, 856, 857, 866, 867, 868, 878, 880, 883, 884, 
886, 887, 888, 890, 895, 896, 897, 898, 902, 904, 905, 906, 907, 908, 911, 912, 
913, 914, 915, 916, 918, 919, 927, 928, 933, 934, 936, 939, 941, 942, 944, 948, 
952, 953, 954, 955, 956, 958, 960, 965, 966, 967, 969, 970, 979, 980, 981, 991, 
993, 996, 997, 999, 1000, 1001, 1003, 1008, 1009, 1010, 1011, 1015, 1017, 1018, 
1019, 1020, 1021, 1024, 1025, 1026, 1027, 1028, 1029, 1031, 1032, 1040, 1041, 
1046, 1047, 1049, 1052, 1054, 1055, 1057, 1061, 1065, 1066, 1067, 1068, 1069, 
1071, 1073, 1078, 1079, 1080, 1082, 1083, 1092, 1093, 1094, 1104, 1106, 1109, 
1110, 1112, 1113, 1114, 1116, 1121, 1122, 1123, 1124, 1128, 1130, 1131, 1132, 
1133, 1134, 1137, 1138, 1139, 1140, 1141, 1142, 1144, 1145, 1153, 1154, 1159, 
1160, 1162, 1165, 1167, 1168, 1170, 1174, 1178, 1179, 1180, 1181, 1182, 1184, 
1186, 1191, 1192, 1193, 1195, 1196, 1205, 1206, 1207, 1211, 1217, 1219, 1222, 
1223, 1225, 1226, 1227, 1229, 1231, 1234, 1235, 1236, 1237, 1241, 1243, 1244, 
1245, 1246, 1247, 1250, 1251, 1252, 1253, 1254, 1255, 1257, 1258, 1264, 1266, 
1267, 1272, 1273, 1275, 1278, 1280, 1281, 1283, 1287, 1291, 1292, 1293, 1294, 
1295, 1297, 1299, 1304, 1305, 1306, 1308, 1309, 1318, 1319, 1320, 1330, 1332, 
1335, 1336, 1338, 1339, 1340, 1342, 1343, 1347, 1348, 1349, 1350, 1353, 1354, 
1356, 1357, 1358, 1359, 1360, 1363, 1364, 1365, 1366, 1367, 1368, 1370, 1371, 
1379, 1380, 1381, 1385, 1386, 1388, 1389, 1391, 1393, 1394, 1396, 1400, 1401, 
1403, 1404, 1405, 1406, 1407, 1408, 1410, 1411, 1412, 1414, 1417, 1418, 1419, 
1421, 1422, 1431, 1432, 1433, 1439, 1443, 1444, 1445, 1448, 1449, 1451, 1452, 
1453, 1455, 1456, 1460, 1461, 1462, 1463, 1466, 1467, 1469, 1470, 1471, 1472, 
1473, 1476, 1477, 1478, 1479, 1480, 1481, 1483, 1484, 1487, 1492, 1493, 1494, 
1498, 1499]{code}
Below you can see logs for one key that didn't replicate is: *key=513* which 
was intended for he bucket *bucketId=61* :

*[vm10]* [warn 2022/04/06 13:51:17.514 CEST server-10 <P2P message reader for 
jakov(server-11:3996)<v4>:41010 unshared ordered sender uid=42 dom #2 local 
port=60262 remote port=59158> tid=0xc1] *Put successfully in the secondary 
queue* : GatewaySenderEventImpl[id=EventID[id=25 
bytes;threadID=0x1003d|869317;sequenceID=513;bucketId=61];action=0;operation=CREATE;region=/test1;{*}key=513;value=513{*};valueIsObject=1;numberOfParts=9;callbackArgument=GatewaySenderEventCallbackArgument
 
[originalCallbackArg=null;originatingSenderId=2;recipientGatewayReceivers=\{1}];possibleDuplicate=false;creationTime=1649245877513;shadowKey=626;timeStamp=1649245877499;acked=false;dispatched=false;bucketId=61;isConcurrencyConflict=false;transactionId=null;isLastEventInTransaction=false]
 and bucket: true was initialized: {}

*[vm11]* [warn 2022/04/06 13:51:17.517 CEST server-11 <P2P message reader for 
jakov(server-7:2520)<v2>:41008 unshared ordered sender uid=38 dom #1 local 
port=56317 remote port=60624> tid=0xa5] *Put successfully in the secondary 
queue* : GatewaySenderEventImpl[id=EventID[id=25 
bytes;threadID=0x1003d|869317;sequenceID=513;bucketId=61];action=0;operation=CREATE;region=/test1;{*}key=513;value=513{*};valueIsObject=1;numberOfParts=9;callbackArgument=GatewaySenderEventCallbackArgument
 
[originalCallbackArg=null;originatingSenderId=2;recipientGatewayReceivers=\{1}];possibleDuplicate=false;creationTime=1649245877515;shadowKey=626;timeStamp=1649245877499;acked=false;dispatched=false;bucketId=61;isConcurrencyConflict=false;transactionId=null;isLastEventInTransaction=false]
 and bucket: true was initialized: {}

Also in above logs it can be seen that bucket advisor interprets primary queue 
bucket on VM11 server as a secondary. VM11 server after region is created has 
become the primary server for that bucket. Before that, the VM7 server was the 
primary. Not sure yet why this happens, and I continue to investigate.

> WAN replication not working after re-creating the partitioned region
> --------------------------------------------------------------------
>
>                 Key: GEODE-10215
>                 URL: https://issues.apache.org/jira/browse/GEODE-10215
>             Project: Geode
>          Issue Type: Bug
>            Reporter: Jakov Varenina
>            Assignee: Jakov Varenina
>            Priority: Major
>              Labels: needsTriage
>
> Steps to reproduce the issue:
> Start multi-site with at least 3 servers on each site. If there are less than 
> three servers then issue will not reproduce.
> Configuration site 1:
> {code:java}
> create disk-store --name=queue_disk_store --dir=ds2
> create gateway-sender -id="remote_site_2" --parallel="true" 
> --remote-distributed-system-id="1"  -enable-persistence=true 
> --disk-store-name=queue_disk_store
> create disk-store --name=data_disk_store --dir=ds1
> create region --name=example-region --type=PARTITION_PERSISTENT 
> --gateway-sender-id="remote_site_2" --disk-store=data_disk_store 
> --total-num-buckets=1103 --redundant-copies=1 --enable-synchronous-disk=false
> #Configure the remote site 2 with the region and the gateway-receiver  
> #Run some traffic so that all buckets are created and data is replicated to 
> the other site
> alter region --name=/example-region --gateway-sender-id=""
> destroy region --name=/example-region
> create region --name=example-region --type=PARTITION_PERSISTENT 
> --gateway-sender-id="remote_site_2" --disk-store=data_disk_store 
> --total-num-buckets=1103 --redundant-copies=1 --enable-synchronous-disk=false
> #run traffic to see that some data is not replicated to the remote site 2 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to