On Wed, 2026-06-24 at 08:54 +0000, Miao, Jun wrote:
> Hi Kai,
> 
> > (Reminder: you forgot the [email protected]).
> > 
> Ok, + CC linux-sgx in this reply.
> 
> > Could you move some context from your v1 and refine together with the above
> > two paragraphs?
> 
> Okay, what about this commit description in v5?
> 
> Subject: [PATCH v5] x86/sgx: Fix RCU Tasks stall in EPC sanitization loop
> 
> During early boot, ksgxd (Intel Software Guard Extensions Kernel Thread)

IMHO there's no need to be so verbose.  The patch has title "x86/sgx: ...", so I
think people who are interested in this patch should already have some basic
idea of what SGX is.

> iterates over all post-kexec dirty EPC pages in a tight loop calling
> cond_resched() after each page.  But, on isolated CPUs
> (a common configuration in cloud VMs), cond_resched() never triggers a
> real context switch because TIF_NEED_RESCHED is not set when no competing
> runnable task exists on that CPU.

After second thought, IIUC, the "isolated CPUs (a common configuration in cloud
VMs)" part is confusing, and actually not relevant IMHO: "isolated CPUs" is from
host kernel's perspective, but the issue is inside the guest.

Am I missing anything?

> 
> BPF LSM subsystem can invoke synchronize_rcu_tasks() at kernel boot time.
> ksgxd() can never be rescheduled() when doing sanitizing all EPC pages.
> As a result, a VM may take a long time to boot:
> 
> [  134.806157] rcu_tasks_wait_gp: rcu_tasks grace period number 1 (since 
> boot) is 130631 jiffies old.
> [  248.086158] INFO: task systemd:1 blocked for more than 122 seconds.
> [  248.086491] Not tainted 6.8.0-90-generic #91-Ubuntu
> [  248.086739] 'echo 0 > /proc/sys/kernel/hung_task_timeout_secs' disables 
> this message.
> [  248.086993] task:systemd    state:D stack:0    pid:1    tpid:1    ppid:0   
>  flags:0x00000002
> [  248.087274] Call Trace:
> ...
> [  248.087939] schedule_timeout+0x157/0x170
> [  248.088120] wait_for_completion+0x88/0x150
> [  248.088304] __wait_rcu_gp+0x17e/0x190
> [  248.088481] synchronize_rcu_tasks_generic+0x64/0x60
> ...
> [  248.089047] synchronize_rcu_tasks+0x15/0x20
> [  248.089260] register_ftrace_direct+0x31f/0x350
> ...
> [  248.090339] bpf_trampoline_link_prog+0x33/0x60
> [  248.090518] bpf_tracing_prog_attach+0x3c5/0x5f0
> ...

These [ 248....] are not needed.

> 
> After this patch test result:
> Tests showed using cond_resched_tasks_rcu_qs() reduced the boot time from
> ~50s to ~10.7s (systemd-analyze: 724ms kernel + 1.575s initrd + 8.481s 
> userspace = 10.782s)

Thinking more, the ~50s boot time isn't quite clear to me either.  The call
trace above shows the systemd has blocked for "more than 122 seconds".

Where was the ~50s from?  I suppose it was kernel boot time (similar to the
"724ms kernel" you mentioned)?

If that is kernel boot time, I think we just need to mention ~50s vs ~700ms.
> 
> [ kai: completely trim down/rewrite changelog ]

No need to have this part.  The obvious reason is there's no my SoB :-)

I don't quite want to completely re-write the changelog, but to save time, how
about below?

  The kernel resets all EPC pages to a clean state in a loop before using them
  for enclaves.  The number of EPC pages could be large (e.g., GBs) thus 
  resetting them could take a fair amount of time.  Because of that, during
  early boot, the kernel resets EPC pages through a kernel thread ksgxd() and
  there's a cond_resched() after resetting each EPC page.

  This is fine in most cases, but becomes a problem when there's other kernel
  code waiting for RCU-Tasks grace period but the cond_resched() in ksgxd()
  never triggers rescheduling.  Because cond_resched() doesn't report quiescent
  state when it doesn't trigger rescheduling, the thread that is waiting for 
  RCU-Tasks grace period will need to wait until all EPC pages are reset.

  For instance, BPF LSM subsystem can invoke synchronize_rcu_tasks() at kernel
  boot time.  A VM with a large EPC assigned and have BPF LSM enabled can take
  a long time to boot, with a call trace triggered:

    rcu_tasks_wait_gp: rcu_tasks grace period number 1 (since boot) is 130631
jiffies old.
    INFO: task systemd:1 blocked for more than 122 seconds.
    ...
    task:systemd    state:D stack:0    pid:1    tpid:1    ppid:0   
flags:0x00000002
    Call Trace:
    ...
    schedule_timeout+0x157/0x170
    wait_for_completion+0x88/0x150
    __wait_rcu_gp+0x17e/0x190
    synchronize_rcu_tasks_generic+0x64/0x60
    ...
    synchronize_rcu_tasks+0x15/0x20
    register_ftrace_direct+0x31f/0x350
    ...
    bpf_trampoline_link_prog+0x33/0x60
    bpf_tracing_prog_attach+0x3c5/0x5f0

Replace cond_resched() with cond_resched_tasks_rcu_qs() which explicitly report
quiescent regardless whether actual rescheduling is triggered.  Resetting all
EPC pages in ksgxd() isn't performance critical so the extra cost of
cond_resched_tasks_rcu_qs() isn't a problem.

Tests showed this reduced the VM kernel boot time from ~50s to ~700ms.

(This assumes the ~50s is the kernel boot time -- please double check.)

Reply via email to