On Wed, Apr 08, 2026 at 11:45:23PM +0200, Mark Kettenis wrote:
> > Date: Wed, 8 Apr 2026 19:24:56 +0200
> > From: Jeremie Courreges-Anglas <[email protected]>

> > We have proof that the system doesn't necessarily crash after that
> > message is printed.  kmos tested the db_enter removal yesterday
> > and confirmed that he got the message on the console without the
> > system crashing.  Using the diff below, I got this today on my LDOM's
> > console:

> > Apr  8 11:37:26 ports /bsd: ctx_free: context 1641 still active in dmmu
> > Apr  8 12:21:12 ports /bsd: ctx_free: context 7896 still active in dmmu
> > Apr  8 12:24:29 ports /bsd: ctx_free: context 3150 still active in dmmu
> > Apr  8 13:43:56 ports /bsd: ctx_free: context 4221 still active in dmmu
> > Apr  8 15:55:50 ports /bsd: ctx_free: context 1264 still active in dmmu
> > Apr  8 18:55:48 ports /bsd: ctx_free: context 5664 still active in dmmu

> Sorry, but this is really bad.  It means stale TSB entries have been
> left behind and may be re-used when the context is re-used.  And that
> could lead to some serious memory corruption.

> If we want to paper over this issue, we should at least invalidate the
> stale TSB entry.  So something like:

>       for (i = 0; i < TSBENTS; i++) {
>               tag = READ_ONCE(&tsb_dmmu[i].tag);
>               if (TSB_TAG_CTX(tag) == oldctx) {
>                       atomic_cas_ulong(&tsb_dmmu[i].tag, tag, 
> TSB_TAG_INVALID);
>                       printf("ctx_free: context %d still active in dmmu\n", 
> oldctx);
>               }
>               tag = READ_ONCE(&tsb_immu[i].tag);
>               if (TSB_TAG_CTX(tag) == oldctx) {
>                       atomic_cas_ulong(&tsb_dmmu[i].tag, tag, 
> TSB_TAG_INVALID);
>                       printf("ctx_free: context %d still active in immu\n", 
> oldctx);
>               }
>       }

I'd definitely prefer something other than the existing db_enter
"solution".  My last full build I had to restart LDOMs at least 8 times
over the course of the 5 day build. Every few times requires dropping to
single user mode for manual fsck to repair filesystems.

The current build is being run with jca's proposed patch and within 4
hours of starting the build, one of the LDOMs already had this on its
console:

ctx_free: context 4575 still active in dmmu

That's before getting to the really memory intensive parts of the
package builds as the heavy C++ and rust builds have to wait for
ports-gcc to finish building some 8-10 hours in.

--Kurt

Reply via email to