Hello, On Sun, Jun 14, 2020 at 5:39 AM Daniël Sonck <dsonc...@gmail.com> wrote: > > Hello, > > I found on the archive that this bug I encountered also happened to > others. I too have a very similar stacktrace. The issue I'm > experiencing is: > > Whenever I fully boot my cluster, in some time, the host crashes with > the __cgroup_bpf_run_filter_skb NULL pointer dereference. This has > been sporadic enough before not to cause real issues. However, as of > lately, the bug is triggered much more frequently. I've changed my > server hardware so I could capture serial output in order to get the > trace. This trace looked very similar as reported by Lu Fengqi. As it > currently stands, I cannot run the cluster as it's almost instantly > crashing the host.
This has been reported for multiple times. Are you able to test the attached patch? And let me know if everything goes fine with it. I suspect we may still leak some cgroup refcnt even with the patch, but it might be much harder to trigger with this patch applied. Thanks.
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 6c9c6ac83936..c01245a19ea2 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -6438,9 +6438,6 @@ void cgroup_sk_alloc_disable(void) void cgroup_sk_alloc(struct sock_cgroup_data *skcd) { - if (cgroup_sk_alloc_disabled) - return; - /* Socket clone path */ if (skcd->val) { /* @@ -6453,6 +6450,9 @@ void cgroup_sk_alloc(struct sock_cgroup_data *skcd) return; } + if (cgroup_sk_alloc_disabled) + return; + /* Don't associate the sock with unrelated interrupted task's cgroup. */ if (in_interrupt()) return;