On 7/21/25 10:14 AM, Michael Tokarev wrote:
On 21.07.2025 19:29, Pierrick Bouvier wrote:
On 7/21/25 9:23 AM, Pierrick Bouvier wrote:
..
looks like a good target for TSAN, which might expose the race without
really having to trigger it.
https://www.qemu.org/docs/master/devel/testing/main.html#building-and-
testing-with-tsan

I think I tried with TSAN and it gave something useful even.
The prob now is to reproduce the thing by someone more familiar
with this stuff than me :)

Else, you can reproduce your run using rr record -h (chaos mode) [1],
which randomly schedules threads, until it catches the segfault, and
then you'll have a reproducible case to debug.

In case you never had opportunity to use rr, it is quite convenient,
because you can set a hardware watchpoint on your faulty pointer (watch
-l), do a reverse-continue, and in most cases, you'll directly reach
where the bug happened. Feels like cheating.

rr is the first thing I tried.  Nope, it's absolutely hopeless.   It
tried to boot just the kernel for over 30 minutes, after which I just
gave up.


I had a similar thing to debug recently, and with a simple loop, I couldn't expose it easily. The bug I had was triggered with 3% probability, which seems close from yours. As rr record -h is single threaded, I found useful to write a wrapper script [1] to run one instance, and then run it in parallel using:
./run_one.sh | head -n 10000 | parallel --bar -j$(nproc)

With that, I could expose the bug in 2 minutes reliably (vs trying for more than one hour before). With your 64 cores, I'm sure it will quickly expose it.

Might be worth a try, as you need to only catch the bug once to be able to reproduce it.

[1] https://github.com/pbo-linaro/qemu/blob/master/try_rme.sh

Thanks,

/mjt


Reply via email to