On 9/13/19 5:59 PM, Eric W. Biederman wrote: > Yonghong Song <y...@fb.com> writes: > >> On 9/11/19 9:16 AM, Eric W. Biederman wrote: >>> Al Viro <v...@zeniv.linux.org.uk> writes: >>> >>>> On Tue, Sep 10, 2019 at 10:35:09PM +0000, Yonghong Song wrote: >>>>> >>>>> Carlos, >>>>> >>>>> Discussed with Eric today for what is the best way to get >>>>> the device number for a namespace. The following patch seems >>>>> a reasonable start although Eric would like to see >>>>> how the helper is used in order to decide whether the >>>>> interface looks right. >>>>> >>>>> commit bb00fc36d5d263047a8bceb3e51e969d7fbce7db (HEAD -> fs2) >>>>> Author: Yonghong Song <y...@fb.com> >>>>> Date: Mon Sep 9 21:50:51 2019 -0700 >>>>> >>>>> nsfs: add an interface function ns_get_inum_dev() >>>>> >>>>> This patch added an interface function >>>>> ns_get_inum_dev(). Given a ns_common structure, >>>>> the function returns the inode and device >>>>> numbers. The function will be used later >>>>> by a newly added bpf helper. >>>>> >>>>> Signed-off-by: Yonghong Song <y...@fb.com> >>>>> >>>>> diff --git a/fs/nsfs.c b/fs/nsfs.c >>>>> index a0431642c6b5..a603c6fc3f54 100644 >>>>> --- a/fs/nsfs.c >>>>> +++ b/fs/nsfs.c >>>>> @@ -245,6 +245,14 @@ struct file *proc_ns_fget(int fd) >>>>> return ERR_PTR(-EINVAL); >>>>> } >>>>> >>>>> +/* Get the device number for the current task pidns. >>>>> + */ >>>>> +void ns_get_inum_dev(struct ns_common *ns, u32 *inum, dev_t *dev) >>>>> +{ >>>>> + *inum = ns->inum; >>>>> + *dev = nsfs_mnt->mnt_sb->s_dev; >>>>> +} >>>> >>>> Umm... Where would it get the device number once we get (hell knows >>>> what for) multiple nsfs instances? I still don't understand what >>>> would that be about, TBH... Is it really per-userns? Or something >>>> else entirely? Eric, could you give some context? >>> >>> My goal is not to paint things into a corner, with future changes. >>> Right now it is possible to stat a namespace file descriptor and >>> get a device and inode number. Then compare that. >>> >>> I don't want people using the inode number in nsfd as some magic >>> namespace id. >>> >>> We have had times in the past where there was more than one superblock >>> and thus more than one device number. Further if userspace ever uses >>> this heavily there may be times in the future where for >>> checkpoint/restart purposes we will want multiple nsfd's so we can >>> preserve the inode number accross a migration. >>> >>> Realistically there will probably just some kind of hotplug notification >>> to userspace to say we have hotplugged your operatining system as >>> a migration notification. >>> >>> Now the halway discussion did not quite capture everything I was trying >>> to say but it at least got to the right ballpark. >>> >>> The helper in fs/nsfs.c should be: >>> >>> bool ns_match(const struct ns_common *ns, dev_t dev, ino_t ino) >>> { >>> return ((ns->inum == ino) && (nsfs_mnt->mnt_sb->s_dev == dev)); >>> } >>> >>> That way if/when there are multiple inodes identifying the same >>> namespace the bpf programs don't need to change. >> >> Thanks, Eric. This is indeed better. The bpf helper should focus >> on comparing dev/ino, instead of return the dev/ino to bpf program. >> >> So overall, nsfs related change will look like: >> >> diff --git a/fs/nsfs.c b/fs/nsfs.c >> index a0431642c6b5..7e78d89c2172 100644 >> --- a/fs/nsfs.c >> +++ b/fs/nsfs.c >> @@ -245,6 +245,11 @@ struct file *proc_ns_fget(int fd) >> return ERR_PTR(-EINVAL); >> } >> >> +bool ns_match(const struct ns_common *ns, dev_t dev, ino_t ino) >> +{ >> + return ((ns->inum == ino) && (nsfs_mnt->mnt_sb->s_dev == dev)); >> +} >> + >> static int nsfs_show_path(struct seq_file *seq, struct dentry *dentry) >> { >> struct inode *inode = d_inode(dentry); >> diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h >> index d31cb6215905..79639807e960 100644 >> --- a/include/linux/proc_ns.h >> +++ b/include/linux/proc_ns.h >> @@ -81,6 +81,7 @@ extern void *ns_get_path(struct path *path, struct >> task_struct *task, >> typedef struct ns_common *ns_get_path_helper_t(void *); >> extern void *ns_get_path_cb(struct path *path, ns_get_path_helper_t >> ns_get_cb, >> void *private_data); >> +extern bool ns_match(const struct ns_common *ns, dev_t dev, ino_t ino); >> >> extern int ns_get_name(char *buf, size_t size, struct task_struct *task, >> const struct proc_ns_operations *ns_ops); >> >>> >>> Up farther in the stack it should be something like: >>> >>>> BPF_CALL_2(bpf_current_pidns_match, dev_t *dev, ino_t *ino) >>>> { >>>> return ns_match(&task_active_pid_ns(current)->ns, *dev, *ino); >>>> } >>>> >>>> const struct bpf_func_proto bpf_current_pidns_match_proto = { >>>> .func = bpf_current_pins_match, >>>> .gpl_only = true, >>>> .ret_type = RET_INTEGER >>>> .arg1_type = ARG_PTR_TO_DEVICE_NUMBER, >>>> .arg2_type = ARG_PTR_TO_INODE_NUMBER, >>>> }; >>> >>> That allows comparing what the bpf came up with with whatever value >>> userspace generated by stating the file descriptor. >>> >>> >>> That is the least bad suggestion I currently have for that >>> functionality. It really would be better to not have that filter in the >>> bpf program itself but in the infrastructure that binds a program to a >>> set of tasks. >>> >>> The problem with this approach is whatever device/inode you have when >>> the namespace they refer to exits there is the possibility that the >>> inode will be reused. So your filter will eventually start matching on >>> the wrong thing. >> >> I come up with a differeent helper definition, which is much more >> similar to existing bpf_get_current_pid_tgid() and helper definition >> much more conforms to bpf convention. > > There is a problem with your bpf_get_ns_current_pid_tgid below. > The inode number is a 64bit number. To be nice to old userspace > we try and not use 64bit inode numbers where they are not required > but in this case we should not use an interface that assumes inode > numbers are 32bit. They just aren't. > > I didn't know how to express that in the bpf proto so I did what > I could.
We can change inum to u64. Just change the prototype like `u64, inum` should be good. > > The alternative to this would be to simply restrict this > helper to bpf programs registered in the initial pid namespace. > At which point you could just ensure all the numbers are in > the global pid namespace. > > Hmm. Looing at the comment below I am confused. > >> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c >> index 5e28718928ca..bc26903c80c7 100644 >> --- a/kernel/bpf/helpers.c >> +++ b/kernel/bpf/helpers.c >> @@ -11,6 +11,8 @@ >> #include <linux/uidgid.h> >> #include <linux/filter.h> >> #include <linux/ctype.h> >> +#include <linux/pid_namespace.h> >> +#include <linux/proc_ns.h> >> >> #include "../../lib/kstrtox.h" >> >> @@ -487,3 +489,33 @@ const struct bpf_func_proto bpf_strtoul_proto = { >> .arg4_type = ARG_PTR_TO_LONG, >> }; >> #endif >> + >> +BPF_CALL_2(bpf_get_ns_current_pid_tgid, u32, dev, u32, inum) >> +{ >> + struct task_struct *task = current; >> + struct pid_namespace *pidns; >> + pid_t pid, tgid; >> + >> + if (unlikely(!task)) >> + return -EINVAL; >> + >> + >> + pidns = task_active_pid_ns(task); >> + if (unlikely(!pidns)) >> + return -ENOENT; >> + >> + if (!ns_match(&pidns->ns, (dev_t)dev, inum)) >> + return -EINVAL; >> + >> + pid = task_pid_nr_ns(task, pidns); >> + tgid = task_tgid_nr_ns(task, pidns); >> + >> + return (u64) tgid << 32 | pid; >> +} >> + >> +const struct bpf_func_proto bpf_get_ns_current_pid_tgid_proto = { >> + .func = bpf_get_ns_current_pid_tgid, >> + .gpl_only = false, >> + .ret_type = RET_INTEGER, >> + .arg1_type = ARG_ANYTHING, >> + .arg2_type = ARG_ANYTHING, >> +}; >> >> Existing usage of bpf_get_current_pid_tgid() can be converted >> to bpf_get_ns_current_pid_tgid() if ns dev/inode number >> is supplied. For bpf_get_ns_current_pid_tgid(), checking >> return value ( < 0 or not) is needed. > > Ok. I missed something. > > What is the problem bpf_get_ns_current_pid_tgid trying to solve > that bpf_get_current_pid_tgid does not solve. > > I would think since much of tracing ebpf is fundamentally restricted > to the global root user. Limiting the ebpf programs to the initial > pid namespace should not be a problem. > > So I don't understand why you need to specify the namespace in > the ebpf call. > > Can someone give me a clue what problem is being sovled by this > new call? We want to run the bpf program inside the namespace. There are parallel work to introduce CAP_BPF and CAP_TRACING (https://lore.kernel.org/bpf/20190906231053.1276792-1-...@kernel.org/T/#t) to facilitate this. We have users requesting to use bcc tools inside the containers. https://github.com/iovisor/bcc/issues/1875 https://github.com/iovisor/bcc/issues/1366 https://github.com/iovisor/bcc/issues/1329 https://github.com/iovisor/bcc/issues/1532 ... Yes, this may require granting `root` privilege to containers. This can be done outside container as well. But it is just a big usability improvement if people can do inside the containers. In addition, we have requests below and internal requests as well to filter based on containers. https://github.com/iovisor/bcc/issues/1119 The new helper permits at root that you can filter based on a particular container (not container id, but dev/inode should identifying one). Hope this clarify why this helper is useful for tracing community. > > Eric >