Some nits in spell/grammar below.  Sorry I didn't go through any spell/grammar
check last night since it was already very late for me.

Btw, the v4 is quicker than I expected anyway.  We are still in merge window so
maintainers are not expected to review this.  For more info, see "Merge window"
section in Documentation/process/maintainer-tip.rst.

On Wed, 2026-06-24 at 22:20 +0800, Jun Miao wrote:
> The kernel resets all EPC pages to a clean state in a loop before using them
> for enclaves.  The number of EPC pages could be large (e.g., GBs) thus
> resetting them could take a fair amount of time.  Because of that, during

s/fair/significant

> early boot, the kernel resets EPC pages through a kernel thread ksgxd() and
> there's a cond_resched() after resetting each EPC page.
> 
> This is fine in most cases, but becomes a problem when there's other kernel
> code waiting for RCU-Tasks grace period but the cond_resched() in ksgxd()
> never triggers rescheduling.  Because cond_resched() doesn't report quiescent
> state when it doesn't trigger rescheduling, the thread that is waiting for
> RCU-Tasks grace period will need to wait until all EPC pages are reset.
> 
> For instance, BPF LSM subsystem can invoke synchronize_rcu_tasks() at kernel
> boot time.  A VM with a large EPC assigned and have BPF LSM enabled can take
                                                 ^
Remove the 'have'.

> a long time to boot, with a call trace triggered:
> 
>     rcu_tasks_wait_gp: rcu_tasks grace period number 1 (since boot) is 130631 
> jiffies old.
>     INFO: task systemd:1 blocked for more than 122 seconds.
>     ...
>     task:systemd  state:D stack:0  pid:1  tpid:1  ppid:0  flags:0x00000002
>     Call Trace:
>     ...
>     schedule_timeout+0x157/0x170
>     wait_for_completion+0x88/0x150
>     __wait_rcu_gp+0x17e/0x190
>     synchronize_rcu_tasks_generic+0x64/0x60
>     ...
>     synchronize_rcu_tasks+0x15/0x20
>     register_ftrace_direct+0x31f/0x350
>     ...
>     bpf_trampoline_link_prog+0x33/0x60
>     bpf_tracing_prog_attach+0x3c5/0x5f0
> 
> Replace cond_resched() with cond_resched_tasks_rcu_qs() which explicitly 
> report quiescent

s/report quiescent/reports quiescent state

> regardless whether actual rescheduling is triggered.  Resetting all EPC pages 
> in ksgxd()

s/regardless/regardless of

> isn't performance critical so the extra cost of cond_resched_tasks_rcu_qs() 
> isn't a problem.
> 
> Tests showed this reduced the VM kernel boot time from ~50s to ~700ms.
> 
> Reported-by: Challvy Tee <[email protected]>
> Link: https://github.com/systemd/systemd/issues/40423
> Fixes: e7e0545299d8 ("x86/sgx: Initialize metadata for Enclave Page Cache 
> (EPC) sections")
> Tested-by: Challvy Tee <[email protected]>
> Suggested-by: Kai Huang <[email protected]>
> Co-developed-by: Fan Du <[email protected]>
> Signed-off-by: Fan Du <[email protected]>
> Signed-off-by: Jun Miao <[email protected]>


Btw, English isn't my first language, so feel free to use AI to improve the
above changelog if you want (I just did but not sure the changes are all better
though).

Anyway, feel free to add:

Reviewed-by: Kai Huang <[email protected]>

Reply via email to