On Thu, May 28, 2026 at 05:52:21PM +0800, Li Chen wrote:
> This RFC adds spawn_template, a userspace-controlled exec acceleration
> mechanism for runtimes that repeatedly start the same executable with
> different argv, envp, and per-spawn file descriptor setup.
> 
> The main target is agent runtimes. Modern coding agents repeatedly start
> short-lived helper tools such as rg, git, sed, awk, python, node, and
> shell wrappers while they inspect and edit a workspace. Those runtimes
> already know which tools are hot, and they are also the right place to
> decide policy. The kernel does not choose names such as rg, git, or sed.
> Userspace opts in by creating a template fd for one executable, then uses
> that fd for later spawns. Launchers, shells, and build systems have a
> similar repeated-startup shape and could use the same primitive, but the
> agent runtime case is the main motivation for this RFC.
> 
[..]
> A typical agent runtime would keep one template per hot executable and
> still build argv, envp, cwd, and pipe wiring for each tool call:
> 
>     rg_tmpl = spawn_template_create("/usr/bin/rg");
> 
>     for each search request:
>         out_r, out_w = pipe_cloexec();
>         err_r, err_w = pipe_cloexec();
>         actions = [
>             FCHDIR(worktree_fd),
>             DUP2(out_w, STDOUT_FILENO),
>             DUP2(err_w, STDERR_FILENO),
>         ];
>         child = spawn_template_spawn(rg_tmpl, rg_argv, envp, actions);
>         close(out_w);
>         close(err_w);
>         read out_r and err_r;
>         waitid(P_PIDFD, child.pidfd, ...);
> 
> 
[..]
> The cached state is intentionally small. The template fd keeps the opened
> main executable file, an optional absolute path string, the creator
> credential pointer, and the deny-write state. The executable identity key
> records device, inode, size, mode, owner, ctime, and mtime, and is
> rechecked before cached metadata is used. The ELF cache keeps only the
> main executable's ELF header, program header table, and program header
> count.
> 
>     cached in this RFC          not cached in this RFC
>     ------------------          ----------------------
>     opened main executable      PT_INTERP metadata
>     executable identity key     shared-library graph
>     main ELF header             VMA layout metadata
>     main ELF program headers    cross-process metadata sharing
>     creator cred pointer
>     deny-write state
> 
> This RFC does not cache ELF interpreter metadata, shared-library
> dependency state, or derived mapping-layout state. Shared-library
> resolution is dynamic linker policy and depends on LD_LIBRARY_PATH,
> RPATH, RUNPATH, /etc/ld.so.cache, mount namespaces, and secure-exec
> state. It also does not share cached executable metadata between template
> fds created by different processes. Each template owns its small cached
> metadata object in this RFC.
> 
> Performance
> ===========
> 
[..]
> Workload     Calls  subprocess  spawn_template  time_s       Delta
> (workers)    calls  calls/s     calls/s         seconds
> 1x16         6144      411.04          420.32   14.95/14.62  +2.26%
> 2x8          6144      666.78          690.08    9.21/8.90   +3.49%
> 4x4          6144      955.61         1003.25    6.43/6.12   +4.99%
> 8x2          6144     1048.25         1069.18    5.86/5.75   +2.00%
> 

This problem is dear to my heart and I have been pondering it on and off
for some time now. The entire fork + exec idiom is terrible and needs to
be retired.

Is this vibe-coded? I asked claude for in-kernel posix_spawn for kicks
some time ago and it generated remarkably similar code. But that's a
tangent.

I'm rather confused by the angle in the patchset. Most of this shaves
off a tiny amount of work, while retaining the primary avoidable reason
for bad performance: the very fact that fork is part of the picture,
especially the part mucking with mm. Creating a pristine process is the
way to go.

Additionally there is a known problem where transiently copied file
descriptors on fork + exec cause a headache in multithreaded programs
doing something like this in parallel. I only did cursory reading, it
seems your patchset keeps the same problem in place.

There are numerous impactful ways to speed up execs both in terms of
single-threaded cost and their multicore scalability, most of which
would be immediately usable by all programs without an opt-in. imo these
needs to be exhausted before something like a "template" can be
considered.

Per the above, the primary win would stem from *NOT* messing with mm.

As in, whatever the interface, it needs to create an "empty" target
process (for lack of a better term).

In terms of userspace-visible APIs, a clean solution escapes me.

Some time ago I proposed returning a handle which is populated over time
by the parnet-to-be. One of the problems with it I failed to consider at
the time is NUMA locality -- what if the process to be created is going
to run on another domain? For example, opening and installing a file for
its later use will result in avoidable loss of locality for some of the
in-kernel data. That's on top of the fd vs fork problem.

>From perf standpoint, the final goal of whatever mechanism should be a
state where the target process avoided copying any state it did not need
to and which allocated any memory it needed from local NUMA node
(whatever it may happen to be). Of course if no affinity is assigned it
may happen to move again and lose such locality, nothing can be done
about that. But pretend the process is to run in a specific node the
parent is NOT running in.

So I think the pragmatic way forward is to implement something close to
posix_spawn in the kernel. It may make sense for the thing to take the
PATH argument for repeated exec attempts. I understand this is of no use
in your particular case, but it very much IS of use for most of the
real-world. The initial implementation might even start with doing vfork
just to get it off the ground.

The next step would be to extend the interface with means to AVOID
copying any file descriptors. There could be a dedicated file action
which tells the kernel to avoid such copies or something like a
close_range file action (or close_from) -- with a range like <0, INT_MAX>
you know no fds are copied.

For the NUMA angle to be sorted out, any file action which opens a file
or dups from the parent needs to execute in the child. And frankly
something would be needed to ask the scheduler where does it think the
child is going to run, so that the task_struct itself can also be
allocated with the right backing.

I have not looked into what's needed to create a new process and NOT
mess with mm, but I don't think there are unsolvable problems there, at
worst some churn.

There are of course other parameters which need to be sorted out, that's
covered by the posix_spawn thing.

This e-mail is long enough, so I'm not going to go into issues
concerning exec itself right now.

tl;dr I would suggest redoing the patchset as posix_spawn and then doing
the actual optimization of not cloning mm itself.

Reply via email to