On 8/9/23 17:15, Harry G Coin wrote:
On 8/9/23 01:00, Alexander Bokovoy wrote:
On Аўт, 08 жні 2023, Harry G Coin wrote:
Thanks for your help. Details below. The problem 'moved' in I hope
a diagnositcally useful way, but the system remains broken.
On 8/8/23 08:54, Alexander Bokovoy wrote:
On Аўт, 08 жні 2023, Harry G Coin wrote:
On 8/8/23 02:43, Alexander Bokovoy wrote:
pstack $(pgrep ns-slapd) > ns-slapd log
Tried an upgrade from 4.9.10 to 4.9.11, the "writeback to ldap
failed" error moved from the primary instance (on which the dns
records were being added) to the replica which hung in the same
fashion. Here's the log you asked for from attempting 'systemctl
restart dirsrv@...' it just hangs at 100% cpu for about 10 minutes.
Thank you. Are you using schema compat for some legacy clients?
This is a fresh install of 4.9.10 about a week ago, upgraded to
4.9.11 yesterday, just two freeipa instances and no appreciable user
load, using the install defaults. The 'in house' system then starts
loading lots of dns records via the python ldap2 interface on the
first of two systems installed, the replica produced what you see in
this post. There is no 'private' information involved of any sort,
it's supposed to field DNS calls from the public but was so
unreliable I had to implement unbound on other servers, so all
freeipa does is IXFR to unbound for the heavy load. I suppose there
may be <16 other in-house lab systems, maybe 2 or 3 with any
activity, that use it for dns. The only other clue is these are
running on VMs in older servers and have no other software packages
installed other than freeipa and what freeipa needs to run, and the
in-house program that loads the dns.
Just to exclude potential problems with schema compat, it can be
disabled if you are not using it.
How? The installs just use all the defaults, other than enabling
dnssec and PTR records for all a/aaaa.
I'm officially in 'desperation mode' as not being able to populate DNS
in freeipa reduces everyone to pencil and paper and coffee with full
project stoppage until it's fixed or at least 'worked around'. So
anything that 'might help' can be sacrificed so at least 'something'
works 'somewhat'. If old AD needs to be 'broken' or 'off' but mostly
the rest of it 'works sort of' then how do I do it?
Really this can't be hard to reproduce, it's just two instances with a
1G link between them, each with a pair of old rusty hard drives in an
lvm mirror using a COW file system, dnssec on, and one of them loading
lots of dns with reverse pointers for each A/AAAA with maybe 200 to
600 PTR records per *arpa and maybe 10-200 records per subdomain,
maybe 200 domains total. A couple python for loops and hey presto
you'll see freeipa lock up without notice in your lab as well. I just
can't imagine causing these race conditions to appear in the case of
the only important load being DNS adds/finds/shows should be difficult.
I appreciate the help, and have become officially fearful about
freeipa. Maybe it's seldom used extensively for DNS and so my use
case is an outlier? Why are so few seeing this? It's a fully
default package install, no custom changes to the OS, freeipa, other
packages. I don't get it.
Thanks for any leads or help!
Hi Harry,
I agree with Mark, nothing suspicious on Thread30. It is flushing its txn.
The discussion is quite long, do you mind to re-explain what are the
current symptoms ?
Is it hanging during update ? consuming CPU ?
Could you run top -H -p <pid> -n 5 -d 3
if it is hanging could you run 'db_stat -CA -h /dev/shm/slapd-<inst>/ -N'
regards
thierry
I don't think it is about named per se, it is a bit of an unfortunate
interop inside ns-slapd between different plugins. bind-dyndb-ldap
relies on the syncrepl extension which implementation in ns-slapd is
using the retro changelog content. Retro changelog plugin triggers some
updates that cause schema compatibility plugin to lock itself up
depending on the order of updates that retro changelog would capture. We
fixed that in slapi-nis package some time ago and it *should* be
ignoring the retro changelog changes but somehow they still propagate
into it. There are few places in ns-slapd which were addressed just
recently and those updates might help (out later this year in RHEL).
Disabling schema compat would be the best.
What's worse, every reboot attempt waits the full '9 min 29 secs'
before systemd forcibly terminates ns-slapd to finish the 'stop job'.
That's why I'm so troubled by all this, it's not like there is any
interference from anything other than what freeipa puts out there,
and it just locks with a message that gives no indication of what to
do about it, with nothing in any logs and 'systemctl
is-system-running' reports 'running'.
You could easily replicate this: imagine a simple validation test
that sets up two freeipa nodes, turns on dnssec, creates some
domains, then adds A AAAA and *.arpa records using the ldap2 api on
one of the nodes. Maybe limit the net speed between the nodes to a
1GB link typical, maybe at most 4 processor cores of some older
vintage and 5GB memory. It takes less than 2 minutes after dns load
start to lock up.
What's really odd is bind9 / named keeps blasting out change
notifications for some of the updated domains, then a few lines
later, with no intervening activity in any log or by any program
affecting the zone, will publish further change notifications with a
new serial number for the same zone. This happens for all the zones
that get modifications. I'm thinking 'rr' computations? I wonder
if those entries-- being auto-generated internally -- are creating a
'flow control' issue between the primary and replica.
This is something that retro changelog is responsible for as it is the
data store used by the syncrepl protocol implementation. If these
'changes' appear again and again, it means retro changelog plugin marks
them as new for this particular syncrepl client (bind-dyndb-ldap).
All threads other than the thread 30 are normal ones (idle threads) but
this one blocks the database backend in the log flush sequence while
writing the retro changelog entry for this updated DNS record:
Thread 30 (Thread 0x7f0e583ff700 (LWP 1438)):
#0 0x00007f0e9bf7d8af in fdatasync () at target:/lib64/libc.so.6
#1 0x00007f0e91cbe6b5 in __os_fsync () at target:/lib64/libdb-5.3.so
#2 0x00007f0e91ca598c in __log_flush_int () at
target:/lib64/libdb-5.3.so
#3 0x00007f0e91ca7dd0 in __log_flush () at target:/lib64/libdb-5.3.so
#4 0x00007f0e91ca7f73 in __log_flush_pp () at
target:/lib64/libdb-5.3.so
#5 0x00007f0e8afe1304 in bdb_txn_commit (li=<optimized out>,
txn=0x7f0e583fd028, use_lock=1) at
ldap/servers/slapd/back-ldbm/db-bdb/bdb_layer.c:2772
#6 0x00007f0e8af95515 in dblayer_txn_commit (be=0x7f0e88424f00,
txn=<optimized out>) at ldap/servers/slapd/back-ldbm/dblayer.c:736
#7 0x00007f0e8afa7ebe in ldbm_back_add (pb=0x7f0e85748860) at
ldap/servers/slapd/back-ldbm/ldbm_add.c:1242
#8 0x00007f0e9d7d7728 in op_shared_add (pb=pb@entry=0x7f0e85748860)
at ldap/servers/slapd/add.c:692
#9 0x00007f0e9d7d7bbe in add_internal_pb
(pb=pb@entry=0x7f0e85748860) at ldap/servers/slapd/add.c:407
#10 0x00007f0e9d7d8975 in slapi_add_internal_pb
(pb=pb@entry=0x7f0e85748860) at ldap/servers/slapd/add.c:331
#11 0x00007f0e8960f8bf in write_replog_db (newsuperior=0x0,
modrdn_mods=0x0, newrdn=0x0, post_entry=<optimized out>,
log_e=0x7f0e4df5b9c0, curtime=1691511446, flag=0,
log_m=0x7f0e5ec0d440, dn=0x7f0e57a40740
"idnsname=8.0.f.0.0.0.0.0.0.0.0.1.0.0.c.f.ip6.arpa.,cn=dns,dc=1,dc=quietfountain,dc=com",
optype=<optimized out>, pb=0x7f0e66a09580) at
ldap/servers/plugins/retrocl/retrocl_po.c:369 #12 0x00007f0e8960f8bf
in retrocl_postob (pb=0x7f0e66a09580, optype=<optimized out>) at
ldap/servers/plugins/retrocl/retrocl_po.c:697
#13 0x00007f0e9d83cc79 in plugin_call_func (list=0x7f0e924aae00,
operation=operation@entry=561, pb=pb@entry=0x7f0e66a09580,
call_one=call_one@entry=0) at ldap/servers/slapd/plugin.c:2032
#14 0x00007f0e9d83cec4 in plugin_call_list (pb=0x7f0e66a09580,
operation=561, list=<optimized out>) at
ldap/servers/slapd/plugin.c:1973
#15 0x00007f0e9d83cec4 in plugin_call_plugins
(pb=pb@entry=0x7f0e66a09580, whichfunction=whichfunction@entry=561)
at ldap/servers/slapd/plugin.c:442
#16 0x00007f0e8afc3658 in ldbm_back_modify (pb=<optimized out>) at
ldap/servers/slapd/back-ldbm/ldbm_modify.c:1002
#17 0x00007f0e9d828300 in op_shared_modify
(pb=pb@entry=0x7f0e66a09580, pw_change=pw_change@entry=0,
old_pw=0x0) at ldap/servers/slapd/modify.c:1025
#18 0x00007f0e9d829a00 in do_modify (pb=pb@entry=0x7f0e66a09580) at
ldap/servers/slapd/modify.c:380
#19 0x0000564ed703475b in connection_dispatch_operation
(pb=0x7f0e66a09580, op=<optimized out>, conn=<optimized out>) at
ldap/servers/slapd/connection.c:651
#20 0x0000564ed703475b in connection_threadmain (arg=<optimized
out>) at ldap/servers/slapd/connection.c:1803
#21 0x00007f0e9a24b968 in _pt_root () at target:/lib64/libnspr4.so
#22 0x00007f0e99be61ca in start_thread () at
target:/lib64/libpthread.so.0
#23 0x00007f0e9be90e73 in clone () at target:/lib64/libc.so.6
Mark, Thierry, any hints here? (For full trace see thread
https://lists.fedorahosted.org/archives/list/[email protected]/thread/TMRXHCORFU3QRQL6FSZTS4OIHYOAVXWF/)
_______________________________________________
FreeIPA-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Fedora Code of Conduct:
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives:
https://lists.fedorahosted.org/archives/list/[email protected]
Do not reply to spam, report it:
https://pagure.io/fedora-infrastructure/new_issue