On 15.11.2017 19:29, Eric W. Biederman wrote: > Kirill Tkhai <ktk...@virtuozzo.com> writes: > >> On 15.11.2017 09:25, Eric W. Biederman wrote: >>> Kirill Tkhai <ktk...@virtuozzo.com> writes: >>> >>>> Curently mutex is used to protect pernet operations list. It makes >>>> cleanup_net() to execute ->exit methods of the same operations set, >>>> which was used on the time of ->init, even after net namespace is >>>> unlinked from net_namespace_list. >>>> >>>> But the problem is it's need to synchronize_rcu() after net is removed >>>> from net_namespace_list(): >>>> >>>> Destroy net_ns: >>>> cleanup_net() >>>> mutex_lock(&net_mutex) >>>> list_del_rcu(&net->list) >>>> synchronize_rcu() <--- Sleep there for >>>> ages >>>> list_for_each_entry_reverse(ops, &pernet_list, list) >>>> ops_exit_list(ops, &net_exit_list) >>>> list_for_each_entry_reverse(ops, &pernet_list, list) >>>> ops_free_list(ops, &net_exit_list) >>>> mutex_unlock(&net_mutex) >>>> >>>> This primitive is not fast, especially on the systems with many processors >>>> and/or when preemptible RCU is enabled in config. So, all the time, while >>>> cleanup_net() is waiting for RCU grace period, creation of new net >>>> namespaces >>>> is not possible, the tasks, who makes it, are sleeping on the same mutex: >>>> >>>> Create net_ns: >>>> copy_net_ns() >>>> mutex_lock_killable(&net_mutex) <--- Sleep there for >>>> ages >>>> >>>> The solution is to convert net_mutex to the rw_semaphore. Then, >>>> pernet_operations::init/::exit methods, modifying the net-related data, >>>> will require down_read() locking only, while down_write() will be used >>>> for changing pernet_list. >>>> >>>> This gives signify performance increase, like you may see below. There >>>> is measured sequential net namespace creation in a cycle, in single >>>> thread, without other tasks (single user mode): >>>> >>>> 1)int main(int argc, char *argv[]) >>>> { >>>> unsigned nr; >>>> if (argc < 2) { >>>> fprintf(stderr, "Provide nr iterations arg\n"); >>>> return 1; >>>> } >>>> nr = atoi(argv[1]); >>>> while (nr-- > 0) { >>>> if (unshare(CLONE_NEWNET)) { >>>> perror("Can't unshare"); >>>> return 1; >>>> } >>>> } >>>> return 0; >>>> } >>>> >>>> Origin, 100000 unshare(): >>>> 0.03user 23.14system 1:39.85elapsed 23%CPU >>>> >>>> Patched, 100000 unshare(): >>>> 0.03user 67.49system 1:08.34elapsed 98%CPU >>>> >>>> 2)for i in {1..10000}; do unshare -n bash -c exit; done >>>> >>>> Origin: >>>> real 1m24,190s >>>> user 0m6,225s >>>> sys 0m15,132s >>>> >>>> Patched: >>>> real 0m18,235s (4.6 times faster) >>>> user 0m4,544s >>>> sys 0m13,796s >>>> >>>> This patch requires commit 76f8507f7a64 "locking/rwsem: Add >>>> down_read_killable()" >>>> from Linus tree (not in net-next yet). >>> >>> Using a rwsem to protect the list of operations makes sense. >>> >>> That should allow removing the sing >>> >>> I am not wild about taking a the rwsem down_write in >>> rtnl_link_unregister, and net_ns_barrier. I think that works but it >>> goes from being a mild hack to being a pretty bad hack and something >>> else that can kill the parallelism you are seeking it add. >>> >>> There are about 204 instances of struct pernet_operations. That is a >>> lot of code to have carefully audited to ensure it can in parallel all >>> at once. The existence of the exit_batch method, net_ns_barrier, >>> for_each_net and taking of net_mutex in rtnl_link_unregister all testify >>> to the fact that there are data structures accessed by multiple network >>> namespaces. >>> >>> My preference would be to: >>> >>> - Add the net_sem in addition to net_mutex with down_write only held in >>> register and unregister, and maybe net_ns_barrier and >>> rtnl_link_unregister. >>> >>> - Factor out struct pernet_ops out of struct pernet_operations. With >>> struct pernet_ops not having the exit_batch method. With pernet_ops >>> being embedded an anonymous member of the old struct pernet_operations. >>> >>> - Add [un]register_pernet_{sys,dev} functions that take a struct >>> pernet_ops, that don't take net_mutex. Have them order the >>> pernet_list as: >>> >>> pernet_sys >>> pernet_subsys >>> pernet_device >>> pernet_dev >>> >>> With the chunk in the middle taking the net_mutex. >> >> I think this approach will work. Thanks for the suggestion. Some more >> thoughts to the plan below. >> >> The only difficult thing there will be to choose the right order >> to move ops from pernet_subsys to pernet_sys and from pernet_device >> to pernet_dev one by one. >> >> This is rather easy in case of tristate drivers, as modules may be loaded >> at any time, and the only important order is dependences between them. >> So, it's possible to start from a module, who has no dependences, >> and move it to pernet_sys, and then continue with modules, >> who have no more dependences in pernet_subsys. For pernet_device >> it's vise versa. >> >> In case of bool drivers, the situation may be worse, because >> the order is not so clear there. The same priority initcalls >> (for example, initcalls, registered via core_initcall()) may require >> the certain order, driven by linking order. I know one example from >> device mapper code, which lives here: drivers/md/Makefile. >> This problem is also solvable, even if such places do not contain >> comments about linking order. It's just need to respect Makefile >> order, when choosing a new candidate to move. >> >>> I think I would enforce the ordering with a failure to register >>> if a subsystem or a device tried to register out of order. >>> >>> - Disable use of the single threaded workqueue if nothing needs the >>> net_mutex. >> >> We may use per-cpu worqueues in the future. The idea to refuse using >> worqueue doesn't seem good for me, because asynchronous net destroying >> looks very useful. > > per-cpu workqueues are fine, and definitely what I am expecting. If we > are doing this I want to get us off the single threaded workqueue that > serializes all of the cleanup. That has a huge potential for > simplifying things and reducing maintenance if running everything in > parallel is actually safe. > > I forget how the modern per-cpu workqueues work with respect to sleeps > and locking (I don't remember if when piece of work sleeps for a long > time we spin up another thread per-cpu workqueue thread, and thus avoid > priority inversion problems).
As I know, there is no a problem for worker thread to sleep, as there is a pool. And right after one leaves cpu runqueue, another worker wakes up. > If locks between workqueues are not a problem we could start the > transition off of the single-threaded serializing workqueue sooner > rather than later. > >>> - Add a test mode that deliberartely spawns threads on multiple >>> processors and deliberately creates multiple network namespaces >>> at the same time. >>> >>> - Add a test mode that deliberately spawns threads on multiple >>> processors and delibearate destrosy multiple network namespaces >>> at the same time.> >>> - Convert the code to unlocked operation one pernet_operations to at a >>> time. Being careful with the loopback device it's order in the list >>> strongly matters. >>> >>> - Finally remove the unnecessary code. >>> >>> >>> At the end of the day because all of the operations for one network >>> namespace will run in parallel with all of the operations for another >>> network namespace all of the sophistication that goes into batching the >>> cleanup of multiple network namespaces can be removed. As different >>> tasks (not sharing a lock) can wait in syncrhonize_rcu at the same time >>> without slowing each other down. >>> >>> I think we can remove the batching but I am afraid that will lead us into >>> rtnl_lock contention. >> >> I've looked into this lock. It used in many places for many reasons. >> It's a little strange, nobody tried to break it up in several small locks.. > > It gets tricky. At this point getting net_mutex is enough to start > with. Mostly the rtnl_lock covers the slow path which tends to keep it > from rising to the top of the priority list. > >>> I am definitely not comfortable with changing the locking on so much >>> code without an explanation at all why it is safe in the commit comments >>> in all 204 instances. Which combined equal most of the well maintained >>> and regularly used part of the networking stack. >> >> David, >> >> could you please check the plan above and say whether 1)it's OK for you, >> and 2)if so, will you expect all the changes are made in one big 200+ patch >> set >> or we may go sequentially(50 patches a time, for example)? > > I think I would start with the something covering the core networking > pieces so the improvements can be seen. > > > Eric > >