I'm not posting this on tech@ because, besides the upcoming release, I'm still not fully convinced with the explanations and solutions (diffs below) I've come to so far with this bug:
https://marc.info/?l=openbsd-tech&m=176956435717456&w=2 Perhaps those who hacked vi(1) in the past can lend me a hand. First of all, a concise explanation: vi(1) crashes when quitting from a temporary buffer in visual mode while running global from ex(1). Steps to reproduce: $ printf 'foo\nfoo\nfoo\n' > foo $ ex foo foo: unmodified: line 3 :g/foo/vi :q :q ex(33568) in free(): write to free mem 0xd2bda37b780[0..7]@192 Abort trap (core dumped) This is not reproducible every time. You need to repeat the above steps many times to get the crash. It's very tricky to know where the real cause is, gdb shows some write-to-free-mem error like the above always in a different place, in many different part of the code. Some times you get a segfault, I realized this segfault is a second bug, *always* reproducible, right after the first attempt, using MALLOC_OPTIONS=J: $ MALLOC_OPTIONS=J ex foo ... (same steps than above) $ egdb ex ex.core Reading symbols from ex... [New process 442438] Core was generated by `ex'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00000a944a0a386c in rcv_sync (sp=0xa96582ec780, flags=0) at /usr/src/usr.bin/vi/build/../common/recover.c:257 257 if (ep->db->sync(ep->db, R_RECNOSYNC)) { There, *ep is poisoned: (gdb) p *ep $1 = {refcnt = -538976289, db = 0xdfdfdfdfdfdfdfdf, c_lp = 0xdfdfdfdfdfdfdfdf <error: Cannot access memory at address 0xdfdfdfdfdfdfdfdf>, c_len = 16131858542891098079, c_lno = 3755991007, c_nlines = 3755991007, log = 0xdfdfdfdfdfdfdfdf, l_lp = 0xdfdfdfdfdfdfdfdf <error: Cannot access memory at address 0xdfdfdfdfdfdfdfdf>, l_len = 16131858542891098079, l_high = 3755991007, l_cur = 3755991007, l_cursor = {lno = 3755991007, cno = 16131858542891098079}, lundo = (FORWARD | BACKWARD | unknown: 0xdfdfdfdc), marks = { lh_first = 0xdfdfdfdfdfdfdfdf}, mdev = -538976289, minode = 16131858542891098079, mtim = {tv_sec = -2314885530818453537, tv_nsec = -2314885530818453537}, fcntl_fd = -538976289, rcv_path = 0xdfdfdfdfdfdfdfdf <error: Cannot access memory at address 0xdfdfdfdfdfdfdfdf>, rcv_mpath = 0xdfdfdfdfdfdfdfdf <error: Cannot access memory at address 0xdfdfdfdfdfdfdfdf>, rcv_fd = -538976289, flags = 57055} The cause of this segfault is that after quitting the last screen, the code makes an extra loop, then the last time it reaches rcv_sync() in ex() (ex/ex.c:137), *sp->ep is already poisoned. /* Sync recovery if changes were made. */ if (F_ISSET(sp->ep, F_RCV_SYNC)) rcv_sync(sp, 0); It's all about timing errors in when and where sp->ep->refcnt and ep->refcnt are incremented and decremented. I've found several ways of managing this counter to avoid the segfault, but not all of those ways do it without breaking things. The following are the best options I got so far: Index: ex/ex.c =================================================================== RCS file: /cvs/src/usr.bin/vi/ex/ex.c,v diff -u -p -u -p -r1.23 ex.c --- ex/ex.c 23 Jun 2023 15:06:45 -0000 1.23 +++ ex/ex.c 18 Apr 2026 06:51:53 -0000 @@ -1601,7 +1601,10 @@ rsuccess: tmp = 0; gp->if_name = NULL; /* Turn off the global bit. */ - F_CLR(sp, SC_EX_GLOBAL); + if (F_ISSET(sp, SC_EX_GLOBAL)) { + F_CLR(sp, SC_EX_GLOBAL); + --sp->ep->refcnt; + } return (tmp); } Another option for the same issue: Index: common/exf.c =================================================================== RCS file: /cvs/src/usr.bin/vi/common/exf.c,v diff -u -p -u -p -r1.50 exf.c --- common/exf.c 15 Feb 2024 00:55:01 -0000 1.50 +++ common/exf.c 15 Apr 2026 09:39:56 -0000 @@ -407,7 +407,8 @@ file_init(SCR *sp, FREF *frp, char *rcv_ O_CLR(sp, O_READONLY); /* Switch... */ - ++ep->refcnt; + if (!F_ISSET(sp, SC_EX_GLOBAL)) + ++ep->refcnt; sp->ep = ep; sp->frp = frp; Now comes what was extremely difficult to find. Once the segfault above is solved, the same steps still reproduce write-after-free errors core dumps like the one in the first example. (Do NOT use MALLOC_OPTIONS=J with this case, it'll hide these errors!) This happens at the point ex commands are called from ex_cmd() loop (ex/ex.c:1374): /* * Call the underlying function for the ex command. * * XXX * Interrupts behave like errors, for now. */ if (ecp->cmd->fn(sp, ecp) || INTERRUPTED(sp)) { if (F_ISSET(gp, G_SCRIPTED)) F_SET(sp, SC_EXIT_FORCE); goto err; } As with the other bug, when you quit the last screen (the last time ex_quit is run in the above conditional) *ecp contains garbage. If you comment out free(ecp) in ex_discard() the problem disapears, obviously leaking memory. The best workaround I came so far is to reload *ecp right after the above conditional: Index: ex/ex.c =================================================================== RCS file: /cvs/src/usr.bin/vi/ex/ex.c,v diff -u -p -u -p -r1.23 ex.c --- ex/ex.c 23 Jun 2023 15:06:45 -0000 1.23 +++ ex/ex.c 16 Apr 2026 14:48:02 -0000 @@ -1373,6 +1373,7 @@ addr_verify: F_SET(sp, SC_EXIT_FORCE); goto err; } + ecp = &gp->excmd; #ifdef DEBUG /* Make sure no function left global temporary space locked. */ So far I'm not aware of regressions produced by the diffs I've posted but, as I said, I'm not convinced, specially with this last one. Any help will be appreciated. -- Walter

