> On Aug 15, 2019, at 5:54 PM, Andy Lutomirski <[email protected]> wrote:
>
>
>
>> On Aug 15, 2019, at 4:46 PM, Alexei Starovoitov
>> <[email protected]> wrote:
>
>
>>>
>>> I'm not sure why you draw the line for VMs -- they're just as buggy
>>> as anything else. Regardless, I reject this line of thinking: yes,
>>> all software is buggy, but that isn't a reason to give up.
>>
>> hmm. are you saying you want kernel community to work towards
>> making containers (namespaces) being able to run arbitrary code
>> downloaded from the internet?
>
> Yes.
>
> As an example, Sandstorm uses a combination of namespaces (user, network,
> mount, ipc) and a moderately permissive seccomp policy to run arbitrary code.
> Not just little snippets, either — node.js, Mongo, MySQL, Meteor, and other
> fairly heavyweight stacks can all run under Sandstorm, with the whole stack
> (database engine binaries, etc) supplied by entirely untrusted customers.
> During the time Sandstorm was under active development, I can recall *one*
> bug that would have allowed a sandbox escape. That’s a pretty good track
> record. (Also, Meltdown and Spectre, sigh.)
>
> To be clear, Sandstorm did not allow creation of a userns by the untrusted
> code, and Sandstorm would have heavily restricted bpf(), but that should only
> be necessary because of the possibility of kernel bugs, not because of the
> overall design.
>
> Alexei, I’m trying to encourage you to aim for something even better than you
> have now. Right now, if you grant a user various very strong capabilities,
> that user’s systemd can use bpf network filters. Your proposal would allow
> this with a different, but still very strong, set of capabilities. There’s
> nothing wrong with this per se, but I think you can aim much higher:
>
> CAP_NET_ADMIN and your CAP_BPF both effectively allow the holder to take over
> the system, *by design*. I’m suggesting that you engage the security
> community (Kees, myself, Aleksa, Jann, Serge, Christian, etc) to aim for
> something better: make it so that a normal Linux distro would be willing to
> relax its settings enough so that normal users can use bpf filtering in the
> systemd units and maybe eventually use even more bpf() capabilities. And
> let’s make is to that mainstream container managers (that use userns!) will
> be willing (as an option) to delegate bpf() to their containers. We’re happy
> to help design, review, and even write code, but we need you to be willing to
> work with us to make a design that seems like it will work and then to wait
> long enough to merge it for us to think about it, try to poke holes in it,
> and convince ourselves and each other that it has a good chance of being
> sound.
>
> Obviously there will be many cases where an unprivileged program should *not*
> be able to use bpf() IP filtering, but let’s make it so that enabling these
> advanced features does not automatically give away the keys to the kingdom.
>
> (Sandstorm still exists but is no longer as actively developed, sadly.)
I am trying to understand different perspectives here.
Disclaimer: Alexei and I both work for Facebook. But he may disagree
with everything I am about to say below, because we haven't sync'ed
about this for a while. :)
I think there are two types of use cases here:
1. CAP_BPF_ADMIN: one big key to all sys_bpf().
2. CAP_BPF: subset of sys_bpf() that is safe for containers.
IIUC, currently, CAP_BPF_ADMIN is (almost) same as CAP_SYS_ADMIN.
And there aren't many real world use cases for CAP_BPF.
The /dev/bpf patch tries to separate CAP_BPF_ADMIN from CAP_SYS_ADMIN.
On the other hand, Andy would like to introduce CAP_BPF and build
amazing use cases around it (chicken-egg problem).
Did I misunderstand anything?
If not, I think these two use cases do not really conflict with each
other, and we probably need both of them. Then, the next question is
do we really need both/either of them. Maybe having two separate
discussions would make it easier?
The following are some questions I am trying to understand for
the two cases.
For CAP_BPF_ADMIN (or /dev/bpf):
Can we just use CAP_NET_ADMIN? It is safer than CAP_SYS_ADMIN, and
reuse existing CAP_ should be easier than introducing a new one?
For CAP_BPF:
Do we really need it for the containers? Is it possible to implement
all container use cases with SUID? At this moment, I think SUID is
the right way to go for this use case, because this is likely to
start with a small set of functionalities. We can introduce CAP_BPF
when the container use case is too complicated for SUID.
I hope some of these questions/thoughts would make some sense?
Thanks,
Song