On Mon, Aug 8, 2016 at 5:00 PM, Sargun Dhillon <sar...@sargun.me> wrote: > On Mon, Aug 08, 2016 at 04:44:02PM -0700, Kees Cook wrote: >> On Thu, Aug 4, 2016 at 12:11 AM, Sargun Dhillon <sar...@sargun.me> wrote: >> > I distributed this patchset to linux-security-mod...@vger.kernel.org >> > earlier, >> > but based on the fact that the archive is down, and this is a fairly >> > broad-sweeping proposal, I figured I'd grow the audience a little bit. >> > Sorry >> > if you received this multiple times. >> > >> > I've begun building out the skeleton of a Linux Security Module, and I'd >> > like to >> > get feedback on it. It's a skeleton, and I've only populated a few hooks, >> > so I'm >> > mostly looking for input on the general proposal, interest, and design. >> > It's a >> > minor LSM. My particular use case is one in which containers are being >> > dynamically deployed to machines by internal developers in a different >> > group. >> > The point of Checmate is to act as an extensible bed for _safe_, complex >> > security policies. It's nice to enable dynamic security policies that can >> > be >> > defined in C, and change as neccessary, without ever having to patch, or >> > rebuild >> > the kernel. >> > >> > For many of these containers, the security policies can be fairly nuanced. >> > One >> > particular one to take into account is network security. Often times, >> > administrators want to prevent ingress, and egress connectivity except >> > from a >> > few select IPs. Egress filtering can be managed using net_cls, but without >> > modifying running software, it's non-trivial to attach a filter to all >> > sockets >> > being created within a container. The inet_conn_request, socket_recvmsg, >> > socket_sock_rcv_skb hooks make this trivial to implement. >> > >> > Other times, containers need to be throttled in places where there's not >> > really >> > a good place to impose that policy for software which isn't built >> > in-house. If >> > one wants to limit file creations/sec, or reject I/O under certain >> > characteristics, there's not a great place to do it now. This gives >> > engineers a >> > mechanism to write those policies. >> > >> > This same flexibility can be used to take existing programs and enable >> > safe BPF >> > helpers to modify memory to allow rules to pass. One example that I >> > prototyped >> > was Docker's port mapping, which has an overhead (DNAT), and there's some >> > loss >> > of fidelity in the BSD Socket API to identify what's going on. Instead, we >> > can >> > just rewrite the port in a bind, based upon some data in a BPF map, and a >> > cgroup >> > match. >> > >> > I can actually see other minor security modules being implemented in >> > Checmate, >> > for example, Yama, or the recently proposed Hardchroot could be >> > reimplemented in >> > BPF. Potentially, they could even be API compatible. >> > >> > Although, at first, much of this sounds like seccomp, it's quite >> > different. For >> > one, what we can do in the security hooks is more complex (access to kernel >> > pointers). The other side of this is we can have effects on a system-wide, >> > or cgroup level. This also circumvents the need for CRIU-friendly policies. >> > >> > Lastly, the flexibility of this mechanism allows for prevention of security >> > vulnerabilities which are often complex in nature and require the >> > interaction >> > of multiple hooks (CVE-2014-9717 is a good example), and although ksplice, >> > and livepatch exist, they're not always easy to use, as compared to loading >> > a single bpf program across all kernels. >> > >> > The user-facing API is exposed via prctl as it's meant to be very simple >> > (at >> > least the kernel components). It only has three operations. For a given >> > security >> > hook, you can attach a BPF program to it, which will add it to the set of >> > programs that are executed over when the hook is hit. You can reset a hook, >> > which removes all program associated with a given hook, and you can set a >> > deny_reset flag on a hook to prevent anyone from resetting it. It's likely >> > that >> > an individual would want to set this in any production use case. >> >> One fairly serious problem that seccomp had to overcome was dealing >> with exec+setuid in the face of an attacker. The main example is "what >> if we refuse to allow a program to drop privileges via a filter rule?" >> For seccomp, no-new-privs was introduced for non-root users of >> seccomp. Programmatic syscall (or LSM) filters need to deal with this, >> and it's a bit ungainly. :) >> > Couldn't someone do the same with SELinux, or Apparmor?
The "big" LSMs aren't defined programmatically by non-root users, so there is no risk of elevating privileges (they are already root). >> Also, if you have a prctl API that already has 3 operations, you might >> want to use a new syscall anyway. :) >> > Looking at other LSMs, they appear to expose their API via a virtual > filesystem, > or prctl. I followed the model of YAMA. I think there may be two more > operations > (detach program, and mark a hook as append-only / read-only / disabled). It > seems like overkill to implement my own syscall. > >> > On the BPF side of it, all that's involved in the work in progress is to >> > move some of the tracing helpers into the shared helpers. For example, >> > it's very valuable to have access to current when enforcing a hook. >> > BPF programs also have access to maps, which somewhat works around >> > the need for security blobs in some cases. >> >> Just from a compatibility perspective, doesn't this end up exposing >> kernel structures to userspace? What happens when the structures >> change? >> > I wouldn't consider BPF userspace. Although it executes in the kernel, I > wouldn't really consider it kernel space either as it's restricted to safe > operations. > > As far as addressing this issue -- A significant part of the LSM hooks API is > tied to the syscall, giving stability to those datastructures. Just for the sake of clarity: they're tied to internal callers, usually near syscall entry points; LSMs can't filter syscalls. > If you look at > the API itself a significant part of it has been untouched for 3+ years, and > it's been even longer since there has been an API breaking change. On the > other > hand, the developer has the ability to perform arbitrary reads of kernel space > using bpf_probe_read. What's hilarious is that syscall API is unchanged, but LSM API keeps shifting around a little at a time. So, same issues as with kprobes, etc, as you mention. FWIW, I'd much rather have an LSM that reacts to seccomp filters and maps syscall arguments to in-kernel data structures that can be examined during an LSM hook. Then we'd have both a stable API and a programmatic filtering of data structures. > This is addressed in the 4th patch, which requires the BPF program is compiled > against the current kernel version. The userspace policy orchestration code > should recompile the BPF program on the fly matching the current kernel's > datastructures. There's a certain level of rope here given to the operator, > and it's expected that they use it carefully. Similarly, folks could load > kprobes, kmods, and other programs that have the same issues. Right, perhaps I misunderstood the privilege level you were targeting. :) Did you intend for unprivileged users to use this, or just the init-ns root user? > >> And from a security perspective, programmatic examination of kernel >> structures means you can trivially leak kernel memory locations and >> contents. Resisting these sorts of leaks needs to be addressed too. >> > I'm unsure of that unintentional exfiltration of kernel memory locations is > possible. You may be able to via a BPF map or similar (logging). What kinds of > attacks are you thinking about specifically? Well, I was looking at the example you sent, and it seemed like it had raw access to kernel pointers, which means it could be programmed to leak the values. >> This looks like a subset of kprobes but available to non-root users, >> which looks rather scary to me at first glance. :) > You need CAP_SYS_ADMIN to touch this. These folks are the same ones that > control > SELinux, and Apparmor. Ah-ha, missed that. Still, we want to keep a bright line between uid-0 and ring-0, and to make sure this is just init-ns CAP_SYS_ADMIN. -Kees > >> >> -Kees >> >> > >> > I would love to know what y'all think. >> > >> > Sargun Dhillon (4): >> > bpf: move tracing helpers to shared helpers >> > bpf, security: Add Checmate >> > security/checmate: Add Checmate sample >> > bpf: Restrict Checmate bpf programs to current kernel ABI >> > >> > include/linux/bpf.h | 2 + >> > include/linux/checmate.h | 38 +++++ >> > include/uapi/linux/Kbuild | 1 + >> > include/uapi/linux/bpf.h | 1 + >> > include/uapi/linux/checmate.h | 65 +++++++++ >> > include/uapi/linux/prctl.h | 3 + >> > kernel/bpf/helpers.c | 34 +++++ >> > kernel/bpf/syscall.c | 2 +- >> > kernel/trace/bpf_trace.c | 33 ----- >> > samples/bpf/Makefile | 4 + >> > samples/bpf/bpf_load.c | 11 +- >> > samples/bpf/checmate1_kern.c | 28 ++++ >> > samples/bpf/checmate1_user.c | 54 +++++++ >> > security/Kconfig | 1 + >> > security/Makefile | 2 + >> > security/checmate/Kconfig | 6 + >> > security/checmate/Makefile | 3 + >> > security/checmate/checmate_bpf.c | 67 +++++++++ >> > security/checmate/checmate_lsm.c | 304 >> > +++++++++++++++++++++++++++++++++++++++ >> > 19 files changed, 622 insertions(+), 37 deletions(-) >> > create mode 100644 include/linux/checmate.h >> > create mode 100644 include/uapi/linux/checmate.h >> > create mode 100644 samples/bpf/checmate1_kern.c >> > create mode 100644 samples/bpf/checmate1_user.c >> > create mode 100644 security/checmate/Kconfig >> > create mode 100644 security/checmate/Makefile >> > create mode 100644 security/checmate/checmate_bpf.c >> > create mode 100644 security/checmate/checmate_lsm.c >> > >> > -- >> > 2.7.4 >> > >> >> >> >> -- >> Kees Cook >> Nexus Security -- Kees Cook Nexus Security