Rick Jones <rick.jon...@hpe.com> writes: > On 07/07/2016 09:34 AM, Eric W. Biederman wrote: >> Rick Jones <rick.jon...@hpe.com> writes: >>> 300 routers is far from the upper limit/goal. Back in HP Public >>> Cloud, we were running as many as 700 routers per network node (*), >>> and more than four network nodes. (back then it was just the one >>> namespace per router and network). Mileage will of course vary based >>> on the "oomph" of one's network node(s). >> >> To clarify processes for these routers and dhcp servers are created >> with "ip netns exec"? > > I believe so, but it would be good to have someone else confirm that, and > speak > to your paragraph below.
>> If that is the case and you are using this feature as effectively a >> lightweight container and not lots vrfs in a single network stack >> then I suspect much larger gains can be had by creating a variant >> of ip netns exec avoids the mount propagation. >> > > ... > >>> * Didn't want to go much higher than that because each router had a >>> port on a common linux bridge and getting to > 1024 would be an >>> unpleasant day. >> >> * I would have thought all you have to do is bump of the size >> of the linux neighbour cache. echo $BIGNUM > >> /proc/sys/net/ipv4/neigh/default/gc_thresh3 > > We didn't want to hit the 1024 port limit of a (then?) Linux bridge. Silly linux bridge. I haven't run into that one. > Having a bit of deja vu but I suspect things like commit > 0818bf27c05b2de56c5b2bd08cfae2a939bd5f52 are not exactly on the same > wavelength, just my brain seeing "namespaces" and "performance" and > lighting-up > :) Actually that could still be relevant. 100,000 or so mount entries is larger than the 16384 of mount entries on the machine I am looking at. Given an expected avearage hash chain length of 6. So it might be worth playing with the mhash= and mphash= kernel command line entries and seeing if upping the count helps. For upstream is probably very much worth looking at making the mount hash an rhashtable so it grows to the size it is needed. I looked a little more and I see where the double mounts are coming from. Because "ip netns" creates /var/run/netns as a local bind mount of itself we get one copy of the mounts below the bind mount and another copy above. Ugh. Unfortunately I think the way the first patch solves this (by breaking mount propagation with the parent) will fail to do the right thing in caseses where "ip netns add" is called from a mount namespace with just a private /tmp like systemd creates to run services in. If we break the mount propagation is broken by making the bind mount private I can't see how the network namespace file descriptor mounts would propagate to the rest of the ordinary mount namespaces in the system. Unfortunately the semantics of the mount propgation directives were not designed for easy use. It seems extremly easy to do the wrong thing. So I think the correct way to avoid double mounts and to safely and reliably do what patch 1 is trying to do is to read /proc/self/mountinfo and see if /var/run/netns is under a shared mount point (possibly itself). If so do go on to creating the mountpoint for the netns file descriptor. Otherwise make /var/run/netns a bind mount to itself and ensure it is marked MS_SHARED. Effectively that is runtime detection of systemd. But since it keys off of what is actually happening on the system it will work in whatever strange environment "ip netns" happens to be run in. Eric