I'm not posting this on tech@ because, besides the upcoming release,
I'm still not fully convinced with the explanations and solutions (diffs
below) I've come to so far with this bug:

  https://marc.info/?l=openbsd-tech&m=176956435717456&w=2

Perhaps those who hacked vi(1) in the past can lend me a hand.  First of
all, a concise explanation:

vi(1) crashes when quitting from a temporary buffer in visual mode while
running global from ex(1).  Steps to reproduce:

  $ printf 'foo\nfoo\nfoo\n' > foo
  $ ex foo
  foo: unmodified: line 3
  :g/foo/vi
  :q
  :q
  ex(33568) in free(): write to free mem 0xd2bda37b780[0..7]@192
  Abort trap (core dumped)

This is not reproducible every time.  You need to repeat the above steps
many times to get the crash.  It's very tricky to know where the real
cause is, gdb shows some write-to-free-mem error like the above always
in a different place, in many different part of the code.  Some times
you get a segfault, I realized this segfault is a second bug, *always*
reproducible, right after the first attempt, using MALLOC_OPTIONS=J:

  $ MALLOC_OPTIONS=J ex foo
  ... (same steps than above)

  $ egdb ex ex.core
  Reading symbols from ex...
  [New process 442438]
  Core was generated by `ex'.
  Program terminated with signal SIGSEGV, Segmentation fault.
  #0  0x00000a944a0a386c in rcv_sync (sp=0xa96582ec780, flags=0)
      at /usr/src/usr.bin/vi/build/../common/recover.c:257
  257                   if (ep->db->sync(ep->db, R_RECNOSYNC)) {

There, *ep is poisoned:

  (gdb) p *ep
  $1 = {refcnt = -538976289, db = 0xdfdfdfdfdfdfdfdf, 
    c_lp = 0xdfdfdfdfdfdfdfdf <error: Cannot access memory at address 
0xdfdfdfdfdfdfdfdf>, c_len = 16131858542891098079, c_lno = 3755991007, 
    c_nlines = 3755991007, log = 0xdfdfdfdfdfdfdfdf, 
    l_lp = 0xdfdfdfdfdfdfdfdf <error: Cannot access memory at address 
0xdfdfdfdfdfdfdfdf>, l_len = 16131858542891098079, l_high = 3755991007, 
    l_cur = 3755991007, l_cursor = {lno = 3755991007, 
      cno = 16131858542891098079}, 
    lundo = (FORWARD | BACKWARD | unknown: 0xdfdfdfdc), marks = {
      lh_first = 0xdfdfdfdfdfdfdfdf}, mdev = -538976289, 
    minode = 16131858542891098079, mtim = {tv_sec = -2314885530818453537, 
      tv_nsec = -2314885530818453537}, fcntl_fd = -538976289, 
    rcv_path = 0xdfdfdfdfdfdfdfdf <error: Cannot access memory at address 
0xdfdfdfdfdfdfdfdf>, 
    rcv_mpath = 0xdfdfdfdfdfdfdfdf <error: Cannot access memory at address 
0xdfdfdfdfdfdfdfdf>, rcv_fd = -538976289, flags = 57055}


The cause of this segfault is that after quitting the last screen, the
code makes an extra loop, then the last time it reaches rcv_sync() in
ex() (ex/ex.c:137), *sp->ep is already poisoned.

        /* Sync recovery if changes were made. */
        if (F_ISSET(sp->ep, F_RCV_SYNC))
                rcv_sync(sp, 0);

It's all about timing errors in when and where sp->ep->refcnt and
ep->refcnt are incremented and decremented.  I've found several ways of
managing this counter to avoid the segfault, but not all of those ways
do it without breaking things.  The following are the best options I got
so far:


Index: ex/ex.c
===================================================================
RCS file: /cvs/src/usr.bin/vi/ex/ex.c,v
diff -u -p -u -p -r1.23 ex.c
--- ex/ex.c     23 Jun 2023 15:06:45 -0000      1.23
+++ ex/ex.c     18 Apr 2026 06:51:53 -0000
@@ -1601,7 +1601,10 @@ rsuccess:        tmp = 0;
        gp->if_name = NULL;
 
        /* Turn off the global bit. */
-       F_CLR(sp, SC_EX_GLOBAL);
+       if (F_ISSET(sp, SC_EX_GLOBAL)) {
+               F_CLR(sp, SC_EX_GLOBAL);
+               --sp->ep->refcnt;
+       }
 
        return (tmp);
 }


Another option for the same issue:

Index: common/exf.c
===================================================================
RCS file: /cvs/src/usr.bin/vi/common/exf.c,v
diff -u -p -u -p -r1.50 exf.c
--- common/exf.c        15 Feb 2024 00:55:01 -0000      1.50
+++ common/exf.c        15 Apr 2026 09:39:56 -0000
@@ -407,7 +407,8 @@ file_init(SCR *sp, FREF *frp, char *rcv_
                O_CLR(sp, O_READONLY);
 
        /* Switch... */
-       ++ep->refcnt;
+       if (!F_ISSET(sp, SC_EX_GLOBAL))
+               ++ep->refcnt;
        sp->ep = ep;
        sp->frp = frp;
 


Now comes what was extremely difficult to find.  Once the segfault above
is solved, the same steps still reproduce write-after-free errors core
dumps like the one in the first example.  (Do NOT use MALLOC_OPTIONS=J
with this case, it'll hide these errors!)  This happens at the point ex
commands are called from ex_cmd() loop (ex/ex.c:1374):

        /*
         * Call the underlying function for the ex command.
         *
         * XXX
         * Interrupts behave like errors, for now.
         */
        if (ecp->cmd->fn(sp, ecp) || INTERRUPTED(sp)) {
                if (F_ISSET(gp, G_SCRIPTED))
                        F_SET(sp, SC_EXIT_FORCE);
                goto err;
        }

As with the other bug, when you quit the last screen (the last time
ex_quit is run in the above conditional) *ecp contains garbage.  If you
comment out free(ecp) in ex_discard() the problem disapears, obviously
leaking memory.  The best workaround I came so far is to reload *ecp
right after the above conditional:


Index: ex/ex.c
===================================================================
RCS file: /cvs/src/usr.bin/vi/ex/ex.c,v
diff -u -p -u -p -r1.23 ex.c
--- ex/ex.c     23 Jun 2023 15:06:45 -0000      1.23
+++ ex/ex.c     16 Apr 2026 14:48:02 -0000
@@ -1373,6 +1373,7 @@ addr_verify:
                        F_SET(sp, SC_EXIT_FORCE);
                goto err;
        }
+       ecp = &gp->excmd;
 
 #ifdef DEBUG
        /* Make sure no function left global temporary space locked. */



So far I'm not aware of regressions produced by the diffs I've posted
but, as I said, I'm not convinced, specially with this last one.  Any
help will be appreciated.


-- 
Walter

Reply via email to