On 16.03.2018 02:46, Christian Brauner wrote: > On Thu, Mar 15, 2018 at 05:14:13PM +0300, Kirill Tkhai wrote: >> On 15.03.2018 16:39, Christian Brauner wrote: >>> On Thu, Mar 15, 2018 at 12:47:30PM +0300, Kirill Tkhai wrote: >>>> CC Andrey Vagin >>> >>> Hey Kirill, >>> >>> Thanks for CCing Andrey. >>> >>>> >>>> On 15.03.2018 03:12, Christian Brauner wrote: >>>>> This patch adds a receive method to NETLINK_KOBJECT_UEVENT netlink sockets >>>>> to allow sending uevent messages into the network namespace the socket >>>>> belongs to. >>>>> >>>>> Currently non-initial network namespaces are already isolated and don't >>>>> receive uevents. There are a number of cases where it is beneficial for a >>>>> sufficiently privileged userspace process to send a uevent into a network >>>>> namespace. >>>>> >>>>> One such use case would be debugging and fuzzing of a piece of software >>>>> which listens and reacts to uevents. By running a copy of that software >>>>> inside a network namespace, specific uevents could then be presented to >>>>> it. >>>>> More concretely, this would allow for easy testing of udevd/ueventd. >>>>> >>>>> This will also allow some piece of software to run components inside a >>>>> separate network namespace and then effectively filter what that software >>>>> can receive. Some examples of software that do directly listen to uevents >>>>> and that we have in the past attempted to run inside a network namespace >>>>> are rbd (CEPH client) or the X server. >>>>> >>>>> Implementation: >>>>> The implementation has been kept as simple as possible from the kernel's >>>>> perspective. Specifically, a simple input method uevent_net_rcv() is added >>>>> to NETLINK_KOBJECT_UEVENT sockets which completely reuses existing >>>>> af_netlink infrastructure and does neither add an additional netlink >>>>> family >>>>> nor requires any user-visible changes. >>>>> >>>>> For example, by using netlink_rcv_skb() we can make use of existing >>>>> netlink >>>>> infrastructure to report back informative error messages to userspace. >>>>> >>>>> Furthermore, this implementation does not introduce any overhead for >>>>> existing uevent generating codepaths. The struct netns gets a new uevent >>>>> socket member that records the uevent socket associated with that network >>>>> namespace. Since we record the uevent socket for each network namespace in >>>>> struct net we don't have to walk the whole uevent socket list. >>>>> Instead we can directly retrieve the relevant uevent socket and send the >>>>> message. This keeps the codepath very performant without introducing >>>>> needless overhead. >>>>> >>>>> Uevent sequence numbers are kept global. When a uevent message is sent to >>>>> another network namespace the implementation will simply increment the >>>>> global uevent sequence number and append it to the received uevent. This >>>>> has the advantage that the kernel will never need to parse the received >>>>> uevent message to replace any existing uevent sequence numbers. Instead it >>>>> is up to the userspace process to remove any existing uevent sequence >>>>> numbers in case the uevent message to be sent contains any. >>>>> >>>>> Security: >>>>> In order for a caller to send uevent messages to a target network >>>>> namespace >>>>> the caller must have CAP_SYS_ADMIN in the owning user namespace of the >>>>> target network namespace. Additionally, any received uevent message is >>>>> verified to not exceed size UEVENT_BUFFER_SIZE. This includes the space >>>>> needed to append the uevent sequence number. >>>>> >>>>> Testing: >>>>> This patch has been tested and verified to work with the following udev >>>>> implementations: >>>>> 1. CentOS 6 with udevd version 147 >>>>> 2. Debian Sid with systemd-udevd version 237 >>>>> 3. Android 7.1.1 with ueventd >>>>> >>>>> Signed-off-by: Christian Brauner <christian.brau...@ubuntu.com> >>>>> --- >>>>> include/net/net_namespace.h | 1 + >>>>> lib/kobject_uevent.c | 88 >>>>> ++++++++++++++++++++++++++++++++++++++++++++- >>>>> 2 files changed, 88 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h >>>>> index f306b2aa15a4..467bde763a9b 100644 >>>>> --- a/include/net/net_namespace.h >>>>> +++ b/include/net/net_namespace.h >>>>> @@ -78,6 +78,7 @@ struct net { >>>>> >>>>> struct sock *rtnl; /* rtnetlink socket */ >>>>> struct sock *genl_sock; >>>>> + struct sock *uevent_sock; /* uevent socket */ >>>> >>>> Since you add this per-net uevent_sock pointer, currently existing >>>> iterations in uevent_net_exit() >>>> become to look confusing. There are: >>>> >>>> mutex_lock(&uevent_sock_mutex); >>>> list_for_each_entry(ue_sk, &uevent_sock_list, list) { >>>> if (sock_net(ue_sk->sk) == net) >>>> goto found; >>>> } >>>> >>>> Can't we make a small cleanup in lib/kobject_uevent.c after this change >>>> and before the main part of the patch goes? >>> >>> Hm, not sure. It seems it makes sense to maintain them in a separate >>> list. Looks like this lets us keep the locking simpler. Otherwise we >>> would have to do something like for_each_net() and it seems that this >>> would force us to use rntl_{un}lock(). >> >> I'm about: >> >> mutex_lock(); >> list_del(net->ue_sk->list); >> mutex_unlock(); >> kfree(); >> >> Thus we avoid iterations in uevent_net_exit(). > > Ah right, but I only added struct sock *uevent_sock to struct net, not > struct uevent_sock. Thus there's no list member in there. I mainly only > added struct sock to keep it clean an aligned with the other sock > member. We can revisit this later though.
In my opinion, we shouldn't leave the code in inconsistent state just because of it's not interesting for this patch. There are two logical changes: 1)add fast-access pointer to struct net 2)implement single-net uevent It's easy to do 1), and this may help further reader not to be confused by the inconsistency, so I don't see a reason why we shouldn't do that. >> >>>> >>>>> >>>>> struct list_head dev_base_head; >>>>> struct hlist_head *dev_name_head; >>>>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c >>>>> index 9fe6ec8fda28..10b2144b9fc3 100644 >>>>> --- a/lib/kobject_uevent.c >>>>> +++ b/lib/kobject_uevent.c >>>>> @@ -25,6 +25,7 @@ >>>>> #include <linux/uuid.h> >>>>> #include <linux/ctype.h> >>>>> #include <net/sock.h> >>>>> +#include <net/netlink.h> >>>>> #include <net/net_namespace.h> >>>>> >>>>> >>>>> @@ -602,12 +603,94 @@ int add_uevent_var(struct kobj_uevent_env *env, >>>>> const char *format, ...) >>>>> EXPORT_SYMBOL_GPL(add_uevent_var); >>>>> >>>>> #if defined(CONFIG_NET) >>>>> +static int uevent_net_send(struct sock *usk, struct sk_buff *skb, >>>>> + struct netlink_ext_ack *extack) >>>>> +{ >>>>> + int ret; >>>>> + u64 seqnum; >>>>> + /* u64 to chars: 2^64 - 1 = 21 chars */ >>>> >>>> 18446744073709551615 -- 20 chars (+1 '\0'). Comment makes me think >>>> we forgot +1 in buf declaration. >>> >>> sizeof("SEQNUM=") will include the '\0' pointer in contrast to >>> strlen("SEQNUM=") so this is correct if I'm not completely mistaken. >> >> The code is OK, I'm worrying about comment. But I've missed this sizeof(). >> So, there is only 1 bytes excess allocated as >> 0xFFFFFFFFFFFFFFFF=18446744073709551615 >> Not so important... >> >>>> >>>>> + char buf[sizeof("SEQNUM=") + 21]; >>>>> + struct sk_buff *skbc; >>>>> + >>>>> + /* bump sequence number */ >>>>> + mutex_lock(&uevent_sock_mutex); >>>>> + seqnum = ++uevent_seqnum; >>>>> + mutex_unlock(&uevent_sock_mutex); >>>> >>>> Commit 7b60a18da393 from Andrew Vagin says: >>>> >>>> uevent: send events in correct order according to seqnum (v3) >>>> >>>> The queue handling in the udev daemon assumes that the events are >>>> ordered. >>>> >>>> Before this patch uevent_seqnum is incremented under sequence_lock, >>>> than an event is send uner uevent_sock_mutex. I want to say that code >>>> contained a window between incrementing seqnum and sending an event. >>>> >>>> This patch locks uevent_sock_mutex before incrementing uevent_seqnum. >>>> Signed-off-by: Andrew Vagin <ava...@openvz.org> >>>> >>>> After this change the order will be lost. Also the rest of namespaces >>>> will have holes in uevent numbers, as they won't receive a number sent >>>> to specific namespace. >>> >>> Afaict from looking into udevd when I wrote the patch it only cares >>> about numbers being ordered (which is also what Andrey's patch states) >>> not that they are sequential so holes should be fine. udevd will use >>> the DEVPATH to determine whether the sequence number of the current >>> uevent should be used as "even->delaying_seqnum" number. All that >>> matters is that it is greater than the previous one. About the ordering, >>> if that's an issue then we should simply do what Andrey has been doing >>> for kobject_uevent_env() and extend the lock being held until after we >>> sent out the uevent. Since we're not serving all listeners but only >>> ever one this is also way more lightweight then kobject_uevent_env(). >> >> Yes, extending the lock to netlink_broadcast() should fix the problem. > > Yep, send another version of the patch but forgot to CC you, sorry: > https://lkml.org/lkml/2018/3/15/1098 > > Thanks! > Christian > >> >>>> >>>> It seems we should made uevent_seqnum per-net to solve this problem. >>> >>> Yes, Eric and I have been discussing this already. The idea was to do >>> this in a follow-up patch to keep this patch as simple as possible. If >>> we agree that it would make sense right away then I will dig out the >>> corresponding patch. >>> It would basically just involve giving each struct net a uevent_seqnum >>> member. >> >> pernet seqnum may have a sense if we introduce per-net locks to protect it. >> If there is single mutex, it does not matter either they are pernet or not. >> Per-net mutex may be useful only if we have many single-net messages like >> you introduce in this patch, or if we are going to filter net in existing >> kobject_uevent_net_broadcast() (by user_ns?!) in the future. >> >> Kirill