On Wed, 2026-06-24 at 08:54 +0000, Miao, Jun wrote:
> Hi Kai,
>
> > (Reminder: you forgot the [email protected]).
> >
> Ok, + CC linux-sgx in this reply.
>
> > Could you move some context from your v1 and refine together with the above
> > two paragraphs?
>
> Okay, what about this commit description in v5?
>
> Subject: [PATCH v5] x86/sgx: Fix RCU Tasks stall in EPC sanitization loop
>
> During early boot, ksgxd (Intel Software Guard Extensions Kernel Thread)
IMHO there's no need to be so verbose. The patch has title "x86/sgx: ...", so I
think people who are interested in this patch should already have some basic
idea of what SGX is.
> iterates over all post-kexec dirty EPC pages in a tight loop calling
> cond_resched() after each page. But, on isolated CPUs
> (a common configuration in cloud VMs), cond_resched() never triggers a
> real context switch because TIF_NEED_RESCHED is not set when no competing
> runnable task exists on that CPU.
After second thought, IIUC, the "isolated CPUs (a common configuration in cloud
VMs)" part is confusing, and actually not relevant IMHO: "isolated CPUs" is from
host kernel's perspective, but the issue is inside the guest.
Am I missing anything?
>
> BPF LSM subsystem can invoke synchronize_rcu_tasks() at kernel boot time.
> ksgxd() can never be rescheduled() when doing sanitizing all EPC pages.
> As a result, a VM may take a long time to boot:
>
> [ 134.806157] rcu_tasks_wait_gp: rcu_tasks grace period number 1 (since
> boot) is 130631 jiffies old.
> [ 248.086158] INFO: task systemd:1 blocked for more than 122 seconds.
> [ 248.086491] Not tainted 6.8.0-90-generic #91-Ubuntu
> [ 248.086739] 'echo 0 > /proc/sys/kernel/hung_task_timeout_secs' disables
> this message.
> [ 248.086993] task:systemd state:D stack:0 pid:1 tpid:1 ppid:0
> flags:0x00000002
> [ 248.087274] Call Trace:
> ...
> [ 248.087939] schedule_timeout+0x157/0x170
> [ 248.088120] wait_for_completion+0x88/0x150
> [ 248.088304] __wait_rcu_gp+0x17e/0x190
> [ 248.088481] synchronize_rcu_tasks_generic+0x64/0x60
> ...
> [ 248.089047] synchronize_rcu_tasks+0x15/0x20
> [ 248.089260] register_ftrace_direct+0x31f/0x350
> ...
> [ 248.090339] bpf_trampoline_link_prog+0x33/0x60
> [ 248.090518] bpf_tracing_prog_attach+0x3c5/0x5f0
> ...
These [ 248....] are not needed.
>
> After this patch test result:
> Tests showed using cond_resched_tasks_rcu_qs() reduced the boot time from
> ~50s to ~10.7s (systemd-analyze: 724ms kernel + 1.575s initrd + 8.481s
> userspace = 10.782s)
Thinking more, the ~50s boot time isn't quite clear to me either. The call
trace above shows the systemd has blocked for "more than 122 seconds".
Where was the ~50s from? I suppose it was kernel boot time (similar to the
"724ms kernel" you mentioned)?
If that is kernel boot time, I think we just need to mention ~50s vs ~700ms.
>
> [ kai: completely trim down/rewrite changelog ]
No need to have this part. The obvious reason is there's no my SoB :-)
I don't quite want to completely re-write the changelog, but to save time, how
about below?
The kernel resets all EPC pages to a clean state in a loop before using them
for enclaves. The number of EPC pages could be large (e.g., GBs) thus
resetting them could take a fair amount of time. Because of that, during
early boot, the kernel resets EPC pages through a kernel thread ksgxd() and
there's a cond_resched() after resetting each EPC page.
This is fine in most cases, but becomes a problem when there's other kernel
code waiting for RCU-Tasks grace period but the cond_resched() in ksgxd()
never triggers rescheduling. Because cond_resched() doesn't report quiescent
state when it doesn't trigger rescheduling, the thread that is waiting for
RCU-Tasks grace period will need to wait until all EPC pages are reset.
For instance, BPF LSM subsystem can invoke synchronize_rcu_tasks() at kernel
boot time. A VM with a large EPC assigned and have BPF LSM enabled can take
a long time to boot, with a call trace triggered:
rcu_tasks_wait_gp: rcu_tasks grace period number 1 (since boot) is 130631
jiffies old.
INFO: task systemd:1 blocked for more than 122 seconds.
...
task:systemd state:D stack:0 pid:1 tpid:1 ppid:0
flags:0x00000002
Call Trace:
...
schedule_timeout+0x157/0x170
wait_for_completion+0x88/0x150
__wait_rcu_gp+0x17e/0x190
synchronize_rcu_tasks_generic+0x64/0x60
...
synchronize_rcu_tasks+0x15/0x20
register_ftrace_direct+0x31f/0x350
...
bpf_trampoline_link_prog+0x33/0x60
bpf_tracing_prog_attach+0x3c5/0x5f0
Replace cond_resched() with cond_resched_tasks_rcu_qs() which explicitly report
quiescent regardless whether actual rescheduling is triggered. Resetting all
EPC pages in ksgxd() isn't performance critical so the extra cost of
cond_resched_tasks_rcu_qs() isn't a problem.
Tests showed this reduced the VM kernel boot time from ~50s to ~700ms.
(This assumes the ~50s is the kernel boot time -- please double check.)