Hi, I need assistance in understanding what kind of processes happen behind the 
work of 389ds server in my case, to come with better approaches in 
troubleshooting further problems.

In our environment we are running FreeIPA instances in Docker, but neither the 
case that it is FreeIPA nor the Docker part actually matter that much, cause 
ns-slapd service is the main actor of a "problem" I faced.

Prerequisites
-------------
Host has 4 CPU, 8gb RAM

FreeIPA v.4.9.8

ns-slapd -v
389 Project
389-Directory/1.4.3.28 B2022.074.1358
-------------

Most of the time the host does not expirience any kind of load, though RAM 
usage (talking about used) can grow linery over time but it makes no odds. 
After some moment (now I can say with some certainty that's because if 
buff/cache grows) the host begin to expirience heavy disk write load, which I 
notice by CPU IOWait alerts. 

cpu iowait
https://imgur.com/ifXPX5N

disk IO and latency
https://imgur.com/HyT0cNR

The increase in load is not due to external causes (increase of ldap queries). 
Actually, "page out" operations were increasing.

https://imgur.com/OxofSD9

and also process exporter showed increased major page faults which served as 
indicator of memory problems (memory shortage though? there was plenty of RAM)

https://imgur.com/xt2p933

Memory status during the problem
--------------------------------
Tasks: 175 total,   1 running, 174 sleeping,   0 stopped,   0 zombie
%CPU(s):  0.7 us,  0.7 sy,  0.0 ni, 29.1 id, 69.4 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :   7957.3 total,    414.3 free,   2332.7 used,   5210.3 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   5300.7 avail Mem 

I tried to use some other instruments that would indicate that something was 
wrong, but I didn't succeed, but I will still add them here for full picture

Here output of 
https://github.com/iovisor/bcc/blob/master/tools/ext4slower_example.txt
shows that fsync syscalls by ns-slapd taking to long

TIME     COMM           PID    T BYTES   OFF_KB   LAT(ms) FILENAME
07:00:02 ns-slapd       4017   S 0       0        1455.28 log.0000006661
07:00:10 ns-slapd       4017   S 0       0        1133.62 log.0000006661
07:00:11 ns-slapd       4017   S 0       0        1248.36 log.0000006661
07:01:02 ns-slapd       4017   S 0       0         736.46 log.0000006661
07:01:05 ns-slapd       4017   S 0       0        1541.29 access
07:01:07 ns-slapd       4017   S 0       0        1220.32 log.0000006661
07:01:08 ns-slapd       4017   S 0       0         832.87 log.0000006661
07:01:08 ns-slapd       4017   S 0       0          93.07 entryusn.db
07:01:09 ns-slapd       4017   S 0       0         418.88 description.db
07:01:10 ns-slapd       4017   S 0       0         822.64 nsuniqueid.db
07:01:10 ns-slapd       4017   S 0       0         417.25 displayname.db
07:01:10 ns-slapd       4017   S 0       0        1493.60 log.0000006661
07:01:11 ns-slapd       4017   S 0       0         617.68 memberHost.db
07:01:12 ns-slapd       4017   S 0       0        1684.12 log.0000006661
07:01:12 ns-slapd       4017   S 0       0         434.58 parentid.db
07:01:12 ns-slapd       4017   S 0       0         601.09 accessRuleType.db
07:01:13 ns-slapd       4017   S 0       0        1051.10 memberUser.db
07:01:14 ns-slapd       4017   S 0       0        1056.28 
ipakrbprincipalalias.db
07:01:14 ns-slapd       4017   S 0       0        1803.00 log.0000006661

And another output of 
https://github.com/iovisor/bcc/blob/master/tools/cachetop_example.txt
I honestly not sure how to interpret those results, maybe it would be helpful

07:15:30 Buffers MB: 97 / Cached MB: 4212 / Sort: HITS / Order: descending
PID      UID      CMD              HITS     MISSES   DIRTIES  READ_HIT%  
WRITE_HIT%
4017 UNKNOWN( ns-slapd             7259     7177     7177       0.6%       0.0%

So, in the attempts to figure out the solution I tried to restart FreeIPA 
container but it had no effect. In the end, I stopped container, dropped all 
memory caches (sync; echo 3 > /proc/sys/vm/drop_caches) and started container, 
and only after that excessive paging out (and disk IO) was gone.

So my questions are:
1) What processes behind the scenes of 389ds server could cause such behavior?
2) Is it (kind of) normal?
3) What actions would you suggest to mitigate such behavior. Would ldap metrics 
in cn=monitor helped to determing something was wrong? (unfortunately they are 
not currently monitored, but will be).
4) Generally, what would you suggest to monitor additionally, maybe some 
metrics that would be of help, but not very obvious. 
_______________________________________________
FreeIPA-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedorahosted.org/archives/list/[email protected]
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue

Reply via email to