We are still experiencing replication stalls. What additional information can I provide to identify the underlying issue?
Thanks, Suresh On Fri, Jan 3, 2025 at 7:52 AM Suresh Veliveli < [email protected]> wrote: > The OS is Rock9, AWS EC2 instance. > > On Thu, Jan 2, 2025 at 10:32 PM Suresh Veliveli < > [email protected]> wrote: > >> This is another instance where the replication stops. >> >> aaa-prod-aws-12:1636 >> # requesting: contextCSN >> contextCSN: *20250102015911.702871Z#000000#000#000000* >> >> All the relevant logs and info: >> >> dn: cn=Consumer 152,cn=Database 1,cn=Databases,cn=Monitor >> structuralObjectClass: olmSyncReplInstance >> creatorsName: >> modifiersName: >> createTimestamp: 20241209130653Z >> modifyTimestamp: 20241209130653Z >> olmSRProviderURIList: ldaps://aaa-master-1.uis.georgetown.edu:636/ >> olmSRConnection: IP=172.20.86.12:49880 >> olmSRSyncPhase: Persist >> olmSRNextConnect: 00000101000000Z >> olmSRLastConnect: 20241229203510Z >> olmSRLastContact: 20250102015934Z >> olmSRLastCookieRcvd: rid=152,csn= >> *20250102015911.702871Z#000000#000#000000* >> olmSRLastCookieSent: rid=152,csn=20241229202835.459483Z#000000#000#000000 >> entryDN: cn=Consumer 152,cn=Database 1,cn=Databases,cn=Monitor >> subschemaSubentry: cn=Subschema >> hasSubordinates: FALSE >> >> *Consumer:* >> netstat -an | grep 49880 >> tcp 0 0 172.20.86.12:49880 172.17.21.52:636 >> ESTABLISHED >> >> *Master:* >> netstat -an | grep 172.20.86.12 >> tcp 0 0 172.17.21.52:636 172.20.86.12:49880 >> ESTABLISHED >> >> *Master logs:* >> Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 >> syncprov_sendresp: >> cookie=rid=152,csn=20250102015911.686467Z#000000#000#000000 >> Jan 1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1 >> syncprov_sendresp: >> cookie=rid=152,csn=20250102015911.702871Z#000000#000#000000 >> >> Nothing about rid=152 is logged after the above >> >> *Consumer logs:* >> Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: do_syncrep2: rid=152 >> cookie=rid=152,csn=20250102015911.702871Z#000000#000#000000 >> Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: syncrepl_entry: rid=152 >> LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY) >> csn=20250102015911.702871Z#000000#000#000000 tid 0x7f7a753fc640 >> Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: slap_queue_csn: queueing >> 0x7f7a687c6190 20250102015911.702871Z#000000#000#000000 >> Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: slap_graduate_commit_csn: >> removing 0x7f7a687c6190 20250102015911.702871Z#000000#000#000000 >> Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: slap_queue_csn: queueing >> 0x7f7a6877d9b0 20250102015911.702871Z#000000#000#000000 >> Jan 1 20:59:34 aaa-prod-aws-12 slapd[1229307]: slap_graduate_commit_csn: >> removing 0x7f7a6877d9b0 20250102015911.702871Z#000000#000#000000 >> >> Nothing about replication is logged after the above. >> >> Thanks, >> Suresh >> >> On Thu, Jan 2, 2025 at 10:08 AM Ondřej Kuzník <[email protected]> >> wrote: >> >>> On Thu, Jan 02, 2025 at 09:39:34AM -0500, Suresh Veliveli wrote: >>> > Another instance: >>> > Yes, TCP keepalive is enabled. >>> >>> So is the TCP connection still open from the point of both servers? See >>> in netstat or ss. >>> >>> > aaa-prod-aws-7:1636 >>> > # requesting: contextCSN >>> > *contextCSN: 20250101065905.147164Z#000000#000#000000* >>> > >>> > aaa-prod-aws-7:2636 >>> > # requesting: contextCSN >>> > contextCSN: 20250102140005.217756Z#000000#000#000000 >>> > >>> > dn: cn=Consumer 147,cn=Database 1,cn=Databases,cn=Monitor >>> > objectClass: olmSyncReplInstance >>> > cn: Consumer 147 >>> >>> All the data in cn=monitor is contained in the operational attributes, >>> as such, you'll have to request them either by name specifically, >>> objectClass >>> ('@olmSyncReplInstance') or blanket '+', maybe also '*' if you want >>> regular attributes as well. >>> >>> > *Consumer logs:* >>> > >>> > [...] >>> > >>> > (Nothing after the above is logged regarding replication) >>> > >>> > *Master:* >>> > >>> > Jan 1 01:59:05 aaa-prod-master-1 slapd[3281130]: conn=1034 op=1 >>> > syncprov_sendresp: >>> > cookie=rid=147,csn=20250101065905.124585Z#000000#000#000000 >>> > Jan 1 01:59:05 aaa-prod-master-1 slapd[3281130]: conn=1034 op=1 >>> > syncprov_sendresp: >>> > cookie=rid=147,csn=20250101065905.147164Z#000000#000#000000 >>> > (Nothing after the above for rid=147) >>> >>> This gives you the string to search for: searching for "conn=1034 op=1" >>> here would give you the messages related to the replication session >>> above. You'll see what happens on the provider and correlate that with >>> what the consumer. For every new consumer session there will be a new >>> "conn=xxx op=yyy" to search for. >>> >>> Regards, >>> >>> -- >>> Ondřej Kuzník >>> Senior Software Engineer >>> Symas Corporation http://www.symas.com >>> Packaged, certified, and supported LDAP solutions powered by OpenLDAP >>> >> >> >> -- >> Suresh Veliveli >> Sr. UNIX Systems Engineer >> Georgetown University >> University Information Services | Security Infrastructure and >> Policy-Identity and Collaboration >> 202-262-6676 (cell) | 202-687-3108 (work) >> > > > -- > Suresh Veliveli > Sr. UNIX Systems Engineer > Georgetown University > University Information Services | Security Infrastructure and > Policy-Identity and Collaboration > 202-262-6676 (cell) | 202-687-3108 (work) > -- Suresh Veliveli Sr. UNIX Systems Engineer Georgetown University University Information Services | Security Infrastructure and Policy-Identity and Collaboration 202-262-6676 (cell) | 202-687-3108 (work)
