On Fri, Oct 05, 2018 at 05:46:59AM +0100, Al Viro wrote: > On Wed, Oct 03, 2018 at 07:57:45PM -0700, Alexei Starovoitov wrote: > > > @@ -15,6 +15,7 @@ > > #include <linux/bpf.h> > > #include <linux/bpf-cgroup.h> > > #include <net/sock.h> > > +#include <../fs/mount.h> > > No.
I've considered splitting cgroup_file_filter_ctx_access() into separate .c inside fs/ directory, but it felt odd to move just that function whereas the rest of the bpf bits are in kernel/bpf/ only to avoid doing this "../fs/" hack. How about calling this new file fs/bpf_file_filter.c ? > > + struct file *file = NULL; > > + struct inode *inode; > > + struct super_block *sb; > > + struct mount *mnt; > > Fuck, no. > > > + case offsetof(struct bpf_file_info, mnt_id): > > + /* dst = real_mount(file->f_path.mnt)->mnt_id */ > > + mnt = real_mount(LD_1(file->f_path.mnt)); > > + LD_n(mnt->mnt_id); > > NAK. Anything in struct mount is private to just a couple of > files in fs/*.c. Don't do that. And keep in mind that internal > details can and will be changed at zero notice, so be careful > with adding such stuff. yes. The internal details of file and inode structs can and will change. Just like all the sk_buff internals that are exposed to bpf via the same context rewriting mechanism. See commit 9bac3d6d548e ("bpf: allow extended BPF programs access skb fields") that exposed first 4 fields out of sk_buff to bpf progs via 'struct __sk_buff'. Other fields were exposed later. Some of them are not even from sk_buff. For networking programs 'struct __sk_buff' is the context. For this new file_filter programs 'struct bpf_file_info' is the context. That's the only thing bpf side can see. > Another problem is your direct poking in ->i_ino. It's not > something directly exposed to userland at the moment and it should > not become such. The patch is not making i_ino directly exposed. Only 'struct bpf_file_info' is exposed to user space / bpf programs. In 'struct bpf_file_info' the field is called '__u64 inode'. That is the abi. The kernel side stays with 'unsigned long i_ino;' field. Its size is different on 32/64 architectures, but to bpf progs it's always seen as '__u64 inode'. If in the future the kernel decides to change 'i_ino' to be u64, nothing will change on bpf side. > Filesystem has every right to have ->getattr() > set ->ino (== ->st_ino value) in whichever way it likes; Essentially what cgroup_file_filter_ctx_access() is doing is the same as generic_fillattr() + cp_statx() work, but via on the fly rewriting of bpf instructions during loading of bpf program. See my other reply to Andy where I argued that certain fields like uid/gid probably don't make sense to expose to bpf via ctx rewriter mechanism and instead they can be done via new bpf_get_file_statx() helper. That's future work. > the same > goes for ->dev. Notice how single kernel field file->f_inode->i_sb->s_dev is exposed to bpf via two fields dev_major and dev_minor inside 'struct bpf_file_info'. For dev_minor the ctx rewriting is done as: + case offsetof(struct bpf_file_info, dev_minor): + /* dst = file->f_inode->i_sb->s_dev */ + inode = LD_1(file->f_inode); + sb = LD_n(inode->i_sb); + LD_n(sb->s_dev); + *insn++ = BPF_ALU32_IMM(BPF_AND, si->dst_reg, MINORMASK); + break; If the last AND instruction was not there, the whole 's_dev' field would have been exposed to bpf side and that would be questionable design choice, since s_dev has kernel internal encoding of major/minor. Instead the AND instruction is masking minor bits. So above ctx rewriting for 'dev_minor' field is equivalent to line stat->dev = inode->i_sb->s_dev; from generic_fillattr() plus another line tmp.stx_dev_minor = MINOR(stat->dev); from cp_statx() This way the kernel internal field 's_dev' is split into two dev_major/dev_minor bpf visible fields. The end result is that no new kernel information is exposed. The same information is seen via statx.