s390x: Add reverse debugging test for s390x

Alex Bennée Mon, 01 Dec 2025 04:44:35 -0800

Ilya Leoshkevich <[email protected]> writes:

> On Mon, 2025-12-01 at 10:36 +0000, Alex Bennée wrote:
>> Ilya Leoshkevich <[email protected]> writes:
>> 
>> > On Sun, 2025-11-30 at 20:03 +0100, Ilya Leoshkevich wrote:
>> > > On Sun, 2025-11-30 at 19:32 +0100, Ilya Leoshkevich wrote:
>> > > > On Sun, 2025-11-30 at 16:47 +0000, Alex Bennée wrote:
>> > > > > Ilya Leoshkevich <[email protected]> writes:
>> > > > > 
>> > > > > > On Fri, 2025-11-28 at 18:25 +0100, Ilya Leoshkevich wrote:
>> > > > > > > On Fri, 2025-11-28 at 14:39 +0100, Thomas Huth wrote:
>> > > > > > > > From: Thomas Huth <[email protected]>
<snip>
>> > > > > The the async_run_on_cpu is called from the vcpu thread in
>> > > > > response
>> > > > > to a
>> > > > > deterministic event at a known point in time it should be
>> > > > > fine.
>> > > > > If
>> > > > > it
>> > > > > came from another thread that is not synchronised via
>> > > > > replay_lock
>> > > > > then
>> > > > > things will go wrong.
>> > > > > 
>> > > > > But this is a VM load save helper?
>> > > > 
>> > > > Yes, and it's called from the main thread. Either during
>> > > > initialization, or as a reaction to GDB packets.
>> > > > 
>> > > > Here is the call stack:
>> > > > 
>> > > >   qemu_loadvm_state()
>> > > >     qemu_loadvm_state_main()
>> > > >       qemu_loadvm_section_start_full()
>> > > >         vmstate_load()
>> > > >           vmstate_load_state()
>> > > >             cpu_post_load()
>> > > >               tcg_s390_tod_updated()
>> > > >                 update_ckc_timer()
>> > > >                   timer_mod()
>> > > >           s390_tod_load()
>> > > >             qemu_s390_tod_set()  # via tdc->set()
>> > > >               async_run_on_cpu(tcg_s390_tod_updated)
>> > > > 
>> > > > So you think we may have to take the replay lock around
>> > > > load_snapshot()? So that all async_run_on_cpu() calls it makes
>> > > > end
>> > > > up
>> > > > being handled by the vCPU thread deterministically.
<snip>
>> > 
>> > I believe now I at least understand what the race is about:
>> > 
>> > - cpu_post_load() fires the TOD timer immediately.
>> > 
>> > - s390_tod_load() schedules work for firing the TOD timer.
>> 
>> Is this a duplicate of work then? Could we just rely on one or the
>> other? If you drop the cpu_post_load() tweak then the vmstate load
>> helper should still ensure everything works right?
>
> Getting rid of it fixes the problem and makes sense anyway.
>
>> > - If rr loop sees work and then timer, we get one timer callback.
>> > 
>> > - If rr loop sees timer and then work, we get two timer callbacks.
>> 
>> If the timer is armed we should expect at least to execute a few
>> instructions before triggering the timer, unless it was armed ready
>> expired.
>
> Yes, it is armed expired.
>
> Isn't it a deficiency in record-replay that work and timers are not
> ordered relative to each other? Can't it bite us somewhere else?


They normally should be although I notice:

  void icount_handle_deadline(void)
  {
      assert(qemu_in_vcpu_thread());
      int64_t deadline = qemu_clock_deadline_ns_all(QEMU_CLOCK_VIRTUAL,
                                                    QEMU_TIMER_ATTR_ALL);

      /*
       * Instructions, interrupts, and exceptions are processed in cpu-exec.
       * Don't interrupt cpu thread, when these events are waiting
       * (i.e., there is no checkpoint)
       */
      if (deadline == 0) {
          icount_notify_aio_contexts();
      }
  }

should run the pre-expired timers before we exec the current TB. But the
comment suggests it is not expecting any checkpoint related activity. I
wonder if we can assert that is the case to catch future issues.

>> > - Record and replay may diverge due to this race.
>> > 
>> > - In this particular case divergence makes rr loop spin: it sees
>> > that
>> >   TOD timer has expired, but cannot invoke its callback, because
>> > there
>> >   is no recorded CHECKPOINT_CLOCK_VIRTUAL.
>> > 
>> > - The order in which rr loop sees work and timer depends on whether
>> >   and when rr loop wakes up during load_snapshot().
>> > 
>> > - rr loop may wake up after the main thread kicks the CPU and drops
>> >   the BQL, which may happen if it calls, e.g.,
>> > qemu_cond_wait_bql().

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro

Re: [RFC PATCH] tests/functional/s390x: Add reverse debugging test for s390x

Reply via email to