On Wed, May 14, 2025 at 02:07:04PM +0200, Natanael Copa wrote:
> [Adding Rich Felker to CC]
> 
> On Tue, 13 May 2025 18:05:50 +0200
> Bruno Haible <br...@clisp.org> wrote:
> 
> > Natanael Copa wrote:
> > > > So, you could try to install a different scheduler by default and repeat
> > > > the test.  
> > > 
> > > It passed with chrt --fifo (I had to do it from outside the LXC 
> > > container):
> > > 
> > > # time chrt --fifo 10 ./test-pthread-rwlock
> > > Starting test_rwlock ... OK
> > > real      0m 33.00s
> > > user      6m 50.63s
> > > sys       0m 16.23s
> > > 
> > > I also verified that it still times out from outside the LXC container 
> > > with the default:
> > > 
> > > # time ./test-pthread-rwlock
> > > Starting test_rwlock ...Command terminated by signal 14
> > > real      10m 0.01s
> > > user      1h 46m 24s
> > > sys       2m 59.39s
> > > 
> > > 
> > > # time chrt --rr 10 ./test-pthread-rwlock
> > > Starting test_rwlock ... OK
> > > real      0m 30.00s
> > > user      6m 2.07s
> > > sys       0m 19.16s
> > > 
> > > # time chrt --rr 99 ./test-pthread-rwlock
> > > Starting test_rwlock ... OK
> > > real      0m 30.00s
> > > user      6m 9.40s
> > > sys       0m 13.37s
> > > 
> > > So even if the CPU cores are slow, they appear to finish in ~30 sec.
> > > 
> > > chrt --other and chrt --idle appears to trigger the deadlock.  
> > 
> > For comparison, some other data (below the details):
> > 
> > * On x86_64 (glibc), I see essentially no influence of the scheduling
> >   policy on 'time ./test-pthread-rwlock'.
> > 
> > * On x86_64 (Alpine Linux), the test performs about 25% faster
> >   under SCHED_FIFO and SCHED_RR.
> > 
> > * On three other riscv64 systems, the test needs less than 4 seconds
> >   real time. Even on my QEMU-emulated riscv64 VM, it needs less
> >   than 4 seconds.
> > 
> > So, it seems that
> >   1) Your riscv64 system is generally slower that the cfarm* ones.
> >   2) The performance penalty of SCHED_OTHER compared to SCHED_FIFO and
> >      SCHED_RR exists also on x86_64, but not to such an extreme extent.
> > 
> > AFAICS, there are three differences in your setup compared to what I
> > see in stock Ubuntu:
> >   - Linux is of a PREEMPTY_DYNAMIC flavour.
> >   - musl libc.
> >   - the LXR container.
> 
> Note that there are 64 CPU cores. I have only tested with that many cores on 
> aarch64.
> 
> I don't think LXC container should matter, nor should apps deadlock
> when running on PREEMPTY_DYNAMIC.
> 
> I'm not sure what the difference is in codepaths compared to GNU libc.
> 
> I also don't get a timeout on an hifive premier p550 system:

Do you even know yet whether this is a deadlock or just a timeout from
taking inordinate time?

Watching if the hung process is still executing (e.g. even with
strace) would be a good start.

If it is deadlocked we really need to look at the deadlocked state in
the debugger (what all the threads are blocked waiting on, so at least
a backtrace for each thread) to determine where the fault lies.

> I suspect the deadlock happens when
> 
> - musl libc systems
> - more than 10 cores(?)
> - CPU cores are slow(?)
> 
> Not sure the exact codepath it takes on GNU libc systems. Is it the
> same as with musl libc?
> 
> > Note: Most Gnulib applications don't use pthread_rwlock directly, but
> > the glthread_rwlock facility. On musl systems, it works around the
> > possible writer starvation by reimplementing read-write locks based
> > on condition variables. This may be slower for a single operation,
> > but it is guaranteed to avoid writer starvation and therefore is
> > preferrable globally. This is why you don't see a timeout in
> > './test-lock', only in './test-pthread-rwlock'.
> 
> Wait a second. The test does not run the gnulib locking? It just tests
> the system (musl libc) pthread rwlock, while the app (gettext) would
> use the gnulib implementation?
> 
> I though the test verified that production code (gettext in this case)
> works as intended. Does this test expose a deadlock that could happen
> in gettext in production?
> 
> I'm confused.

AFAICT the code is in tests/test-lock.c from the gnulib repo and calls
the gl_rwlock_* functions, which should be using the gnulib condvar
based implementation.

Rich

Reply via email to