~nikolay.vinogradov and I found the root cause after a long debugging
session.
What we have:

* a bionic-ussuri deployment with the HWE (focal 5.4) kernel;

* 2 processes communicating over a type=STREAM unix socket (blocking);
* Unix sockets have max buffer limits which means the amount of data that can 
be written there before a blocking occurs is far from infinite: 
https://elixir.bootlin.com/linux/v5.4.67/source/include/net/sock.h#L2557

* the privsep daemon process is the server which responds to a client
(neutron-openvswitch-agent) request "get_devices_with_ip"

   * File "/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 
1364, in get_devices_with_ip
   * File "/usr/lib/python3/dist-packages/oslo_privsep/daemon.py", line 195, in 
remote_call

* when debug logging is enabled, the privsep daemon returns a log
message to be printed by the caller to its log which includes the full
reply;


* the client has a separate reader thread which reads messages from the unix 
socket in a loop and sets results for previously created Futures associated 
with message ids. The original thread that initiates a request waits for a 
result associated with the Future to appear;

https://github.com/openstack/oslo.privsep/blob/stable/ussuri/oslo_privsep/comm.py#L120-L124
 (thread creation)
https://github.com/openstack/oslo.privsep/blob/stable/ussuri/oslo_privsep/comm.py#L130-L155
 (the thread's main function)

https://github.com/openstack/oslo.privsep/blob/stable/ussuri/oslo_privsep/comm.py#L172
(the code that waits for the result of a future to appear).

* msgpack is used for serialization of data sent of the unix socket;

* the node has hardware offloading enabled and there are many network
interfaces registered with the kernel. The `ip addr show` output alone
has 764 lines:

root@node2:~# ip a s | wc -l
764

* The log message that the privsep daemon attempts to send is 1483120
bytes = 1.414 MiB (not counting the overhead).

https://paste.ubuntu.com/p/Sf2wrcpWvB/
(Pdb) where
  /usr/bin/privsep-helper(10)<module>()
-> sys.exit(helper_main())
  /usr/lib/python3/dist-packag
  /usr/lib/python3/dist-packages/oslo_privsep/daemon.py(170)emit()
-> self.channel.send((None, (Message.LOG, data)))
> /usr/lib/python3/dist-packages/oslo_privsep/comm.py(202)send()
-> self.writer.send(msg)

(Pdb) len(str(msg))
1483120


* The client reader thread tries to receive this message over the unix socket 
but receives an error from msgpack which does not like the message size:

(Pdb) l
131             import rpdb ; rpdb.set_trace()
132             with self.lock:
133                 self.running = True
134             for msg in reader:
135                 msgid, data = msg
136  ->             if msgid is None:
137                     self.out_of_band(data)
138                 else:
139                     with self.lock:
140                         if msgid not in self.outstanding_msgs:
141                             raise AssertionError("msgid should in "
(Pdb) n
> /usr/lib/python3/dist-packages/oslo_privsep/comm.py(137)_reader_main()
-> self.out_of_band(data)
(Pdb) n
> /usr/lib/python3/dist-packages/oslo_privsep/comm.py(134)_reader_main()
-> for msg in reader:
(Pdb) n
ValueError: ('%s exceeds max_str_len(%s)', 1482520, 1048576)


https://git.launchpad.net/ubuntu/+source/python-msgpack/tree/msgpack/unpack.h?h=ubuntu/focal#n229
        PyErr_Format(PyExc_ValueError, "%u exceeds max_str_len(%zd)", l, 
u->max_str_len);
https://github.com/msgpack/msgpack-python/blob/v0.6.2/msgpack/unpack.h#L226-L229


* The above exception never gets logged by neutron-openvswitch-agent and the 
requesting thread is stuck while waiting for a result from a Future object.

** Also affects: python-oslo.privsep (Ubuntu)
   Importance: Undecided
       Status: New

** Summary changed:

- A privsep daemon hangs when debug logging is enabled
+ A privsep daemon spawned by neutron-openvswitch-agent hangs when debug 
logging is enabled (large number of registered NICs) - an RPC response is too 
large for msgpack

** Changed in: charm-neutron-openvswitch
       Status: New => Invalid

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1896734

Title:
  A privsep daemon spawned by neutron-openvswitch-agent hangs when debug
  logging is enabled (large number of registered NICs) - an RPC response
  is too large for msgpack

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1896734/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to