We are still experiencing replication stalls. What additional information
can I provide to identify the underlying issue?

Thanks,
Suresh

On Fri, Jan 3, 2025 at 7:52 AM Suresh Veliveli <
[email protected]> wrote:

> The OS is Rock9, AWS EC2 instance.
>
> On Thu, Jan 2, 2025 at 10:32 PM Suresh Veliveli <
> [email protected]> wrote:
>
>> This is another instance where the replication stops.
>>
>>  aaa-prod-aws-12:1636
>> # requesting: contextCSN
>> contextCSN: *20250102015911.702871Z#000000#000#000000*
>>
>> All the relevant logs and info:
>>
>> dn: cn=Consumer 152,cn=Database 1,cn=Databases,cn=Monitor
>> structuralObjectClass: olmSyncReplInstance
>> creatorsName:
>> modifiersName:
>> createTimestamp: 20241209130653Z
>> modifyTimestamp: 20241209130653Z
>> olmSRProviderURIList: ldaps://aaa-master-1.uis.georgetown.edu:636/
>> olmSRConnection: IP=172.20.86.12:49880
>> olmSRSyncPhase: Persist
>> olmSRNextConnect: 00000101000000Z
>> olmSRLastConnect: 20241229203510Z
>> olmSRLastContact: 20250102015934Z
>> olmSRLastCookieRcvd: rid=152,csn=
>> *20250102015911.702871Z#000000#000#000000*
>> olmSRLastCookieSent: rid=152,csn=20241229202835.459483Z#000000#000#000000
>> entryDN: cn=Consumer 152,cn=Database 1,cn=Databases,cn=Monitor
>> subschemaSubentry: cn=Subschema
>> hasSubordinates: FALSE
>>
>> *Consumer:*
>> netstat -an | grep 49880
>> tcp        0      0 172.20.86.12:49880      172.17.21.52:636
>>  ESTABLISHED
>>
>> *Master:*
>> netstat -an | grep 172.20.86.12
>> tcp        0      0 172.17.21.52:636        172.20.86.12:49880
>>  ESTABLISHED
>>
>> *Master logs:*
>> Jan  1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1
>> syncprov_sendresp:
>> cookie=rid=152,csn=20250102015911.686467Z#000000#000#000000
>> Jan  1 20:59:18 aaa-prod-master-1 slapd[3281130]: conn=1035 op=1
>> syncprov_sendresp:
>> cookie=rid=152,csn=20250102015911.702871Z#000000#000#000000
>>
>> Nothing about rid=152 is logged after the above
>>
>> *Consumer logs:*
>> Jan  1 20:59:34 aaa-prod-aws-12 slapd[1229307]: do_syncrep2: rid=152
>> cookie=rid=152,csn=20250102015911.702871Z#000000#000#000000
>> Jan  1 20:59:34 aaa-prod-aws-12 slapd[1229307]: syncrepl_entry: rid=152
>> LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY)
>> csn=20250102015911.702871Z#000000#000#000000 tid 0x7f7a753fc640
>> Jan  1 20:59:34 aaa-prod-aws-12 slapd[1229307]: slap_queue_csn: queueing
>> 0x7f7a687c6190 20250102015911.702871Z#000000#000#000000
>> Jan  1 20:59:34 aaa-prod-aws-12 slapd[1229307]: slap_graduate_commit_csn:
>> removing 0x7f7a687c6190 20250102015911.702871Z#000000#000#000000
>> Jan  1 20:59:34 aaa-prod-aws-12 slapd[1229307]: slap_queue_csn: queueing
>> 0x7f7a6877d9b0 20250102015911.702871Z#000000#000#000000
>> Jan  1 20:59:34 aaa-prod-aws-12 slapd[1229307]: slap_graduate_commit_csn:
>> removing 0x7f7a6877d9b0 20250102015911.702871Z#000000#000#000000
>>
>> Nothing about replication is logged after the above.
>>
>> Thanks,
>> Suresh
>>
>> On Thu, Jan 2, 2025 at 10:08 AM Ondřej Kuzník <[email protected]>
>> wrote:
>>
>>> On Thu, Jan 02, 2025 at 09:39:34AM -0500, Suresh Veliveli wrote:
>>> > Another instance:
>>> > Yes, TCP keepalive is enabled.
>>>
>>> So is the TCP connection still open from the point of both servers? See
>>> in netstat or ss.
>>>
>>> > aaa-prod-aws-7:1636
>>> > # requesting: contextCSN
>>> > *contextCSN: 20250101065905.147164Z#000000#000#000000*
>>> >
>>> > aaa-prod-aws-7:2636
>>> > # requesting: contextCSN
>>> > contextCSN: 20250102140005.217756Z#000000#000#000000
>>> >
>>> > dn: cn=Consumer 147,cn=Database 1,cn=Databases,cn=Monitor
>>> > objectClass: olmSyncReplInstance
>>> > cn: Consumer 147
>>>
>>> All the data in cn=monitor is contained in the operational attributes,
>>> as such, you'll have to request them either by name specifically,
>>> objectClass
>>> ('@olmSyncReplInstance') or blanket '+', maybe also '*' if you want
>>> regular attributes as well.
>>>
>>> > *Consumer logs:*
>>> >
>>> > [...]
>>> >
>>> > (Nothing after the above is logged regarding replication)
>>> >
>>> > *Master:*
>>> >
>>> > Jan  1 01:59:05 aaa-prod-master-1 slapd[3281130]: conn=1034 op=1
>>> > syncprov_sendresp:
>>> > cookie=rid=147,csn=20250101065905.124585Z#000000#000#000000
>>> > Jan  1 01:59:05 aaa-prod-master-1 slapd[3281130]: conn=1034 op=1
>>> > syncprov_sendresp:
>>> > cookie=rid=147,csn=20250101065905.147164Z#000000#000#000000
>>> > (Nothing after the above for rid=147)
>>>
>>> This gives you the string to search for: searching for "conn=1034 op=1"
>>> here would give you the messages related to the replication session
>>> above. You'll see what happens on the provider and correlate that with
>>> what the consumer. For every new consumer session there will be a new
>>> "conn=xxx op=yyy" to search for.
>>>
>>> Regards,
>>>
>>> --
>>> Ondřej Kuzník
>>> Senior Software Engineer
>>> Symas Corporation                       http://www.symas.com
>>> Packaged, certified, and supported LDAP solutions powered by OpenLDAP
>>>
>>
>>
>> --
>> Suresh Veliveli
>> Sr. UNIX Systems Engineer
>> Georgetown University
>> University Information Services | Security Infrastructure and
>> Policy-Identity and Collaboration
>> 202-262-6676 (cell) | 202-687-3108 (work)
>>
>
>
> --
> Suresh Veliveli
> Sr. UNIX Systems Engineer
> Georgetown University
> University Information Services | Security Infrastructure and
> Policy-Identity and Collaboration
> 202-262-6676 (cell) | 202-687-3108 (work)
>


-- 
Suresh Veliveli
Sr. UNIX Systems Engineer
Georgetown University
University Information Services | Security Infrastructure and
Policy-Identity and Collaboration
202-262-6676 (cell) | 202-687-3108 (work)

Reply via email to