Hello,

Somewhere along the way I got the impression that it generally takes those affected hours before their systems lock up. I'm (generally) able to reproduce this issue much faster than that. Regardless, I can help test.

Are there any patches that need testing or is this all still pending discussion around the  best way to resolve the issue?

Thanks!

On 6/23/20 3:21 PM, Roman Gushchin wrote:
On Fri, Jun 19, 2020 at 08:31:14PM -0700, Cong Wang wrote:
On Fri, Jun 19, 2020 at 5:51 PM Zefan Li <lize...@huawei.com> wrote:
在 2020/6/20 8:45, Zefan Li 写道:
On 2020/6/20 3:51, Cong Wang wrote:
On Thu, Jun 18, 2020 at 11:40 PM Zefan Li <lize...@huawei.com> wrote:
On 2020/6/19 5:09, Cong Wang wrote:
On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin <g...@fb.com> wrote:
On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lize...@huawei.com> wrote:
Cc: Roman Gushchin <g...@fb.com>

Thanks for fixing this.

On 2020/6/17 2:03, Cong Wang wrote:
When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
copied, so the cgroup refcnt must be taken too. And, unlike the
sk_alloc() path, sock_update_netprioidx() is not called here.
Therefore, it is safe and necessary to grab the cgroup refcnt
even when cgroup_sk_alloc is disabled.

sk_clone_lock() is in BH context anyway, the in_interrupt()
would terminate this function if called there. And for sk_alloc()
skcd->val is always zero. So it's safe to factor out the code
to make it more readable.

Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 
cgroups")
but I don't think the bug was introduced by this commit, because there
are already calls to cgroup_sk_alloc_disable() in write_priomap() and
write_classid(), which can be triggered by writing to ifpriomap or
classid in cgroupfs. This commit just made it much easier to happen
with systemd invovled.

I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup 
itself"),
which added cgroup_bpf_get() in cgroup_sk_alloc().
Good point.

I take a deeper look, it looks like commit d979a39d7242e06
is the one to blame, because it is the first commit that began to
hold cgroup refcnt in cgroup_sk_alloc().
I agree, ut seems that the issue is not related to bpf and probably
can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
seems closer to the origin.
Yeah, I will update the Fixes tag and send V2.

Commit d979a39d7242e06 looks innocent to me. With this commit when 
cgroup_sk_alloc
is disabled and then a socket is cloned the cgroup refcnt will not be 
incremented,
but this is fine, because when the socket is to be freed:

  sk_prot_free()
    cgroup_sk_free()
      cgroup_put(sock_cgroup_ptr(skcd)) == cgroup_put(&cgrp_dfl_root.cgrp)

cgroup_put() does nothing for the default root cgroup, so nothing bad will 
happen.
But skcd->val can be a pointer to a non-root cgroup:
It returns a non-root cgroup when cgroup_sk_alloc is not disabled. The bug 
happens
when cgroup_sk_alloc is disabled.

And please read those recent bug reports, they all happened when bpf cgroup was 
in use,
and there was no bpf cgroup when d979a39d7242e06 was merged into mainline.
I am totally aware of this. My concern is whether cgroup
has the same refcnt bug as it always pairs with the bpf refcnt.

But, after a second look, the non-root cgroup refcnt is immediately
overwritten by sock_update_classid() or sock_update_netprioidx(),
which effectively turns into a root cgroup again. :-/

(It seems we leak a refcnt here, but this is not related to my patch).
Yeah, I looked over this code, and I have the same suspicion.
Especially in sk_alloc(), where cgroup_sk_alloc() is followed by
sock_update_classid() and sock_update_netprioidx().

I also think your original patch is good, but there are probably
some other problems which it doesn't fix.

I looked over cgroup bpf code again, and the only difference with cgroup
refcounting I see (behind the root cgroup, which is a non-issue) is
here:

void cgroup_sk_alloc(struct sock_cgroup_data *skcd)
{
        ...
        while (true) {
                struct css_set *cset;

                cset = task_css_set(current);
                if (likely(cgroup_tryget(cset->dfl_cgrp))) {
                          ^^^^^^^^^^^^^^^
                        skcd->val = (unsigned long)cset->dfl_cgrp;
                        cgroup_bpf_get(cset->dfl_cgrp);
                        ^^^^^^^^^^^^^^
                        break;
                        ...

So, in theory, cgroup_bpf_get() can be called here after cgroup_bpf_release().
We might wanna introduce something like cgroup_bpf_tryget_live().
Idk if it can happen in reality, because it would require opening a new socket
in a deleted cgroup (without any other associated sockets).

Other than that I don't see any differences between cgroup and cgroup bpf
reference counting.

Thanks!

PS I'll be completely offline till the end of the week. I'll answer all
e-mails on Monday (Jun 29th).

Reply via email to