Package: libc6 Version: 2.25-3 Severity: important Tags: upstream TL;DR: pthread_cond_broadcast() on a process shared condition variable will block indefinitely when another process that pthread_cond_wait()'ed on this condition gets killed and restarted.
starting with libc6:armd64 version 2.25-3 from debian testing/buster, our in-house robotics communication middleware (links_and_nodes) does no longer behave as expected: this middleware uses process shared mutex'es and condition variables in shared memory (pthread_mutex_t, pthread_cond_t, shm_open) for synchronization between processes. attached is a simple self-contained test-case where a "publisher"-process repeatedly increments a counter in shm (while holding a pshared mutex) and then broadcasts the condition. another process (lets call it "subscriber") is blocking waiting on that same condition variable (with the same mutex). this works as expected. until the subscriber is killed/terminated by any signal while it is waiting. when the subscriber is then started a 2nd time, the publisher gets blocked in its call to pthread_cond_broadcast()! with glibc <= 2.24 this caused no problems. i saw that there is a new condvar impl in glibc 2.25 -- so this is probably something for upstream. with the attached test case i can reproduce this problem with debians libc6 version 2.25-3 (buster), 2.25-5 (sid) and 2.26-0experimental2 (experimental). on stable with 2.24-11+deb9u1 this problem does not occour! attached is a Makefile, main.cpp and howto-reproduce.txt which should be all thats needed to reproduce it (tested with gcc version 7.2.1 20171205 (Debian 7.2.0-17) and others...): $ make $ ./condvar-test publisher in another terminal: $ ./condvar-test subscriber all works fine. now kill the subscriber via a signal, eg SIGINT with Ctrl-C. publisher is still happy. now restart the subscriber: $ ./condvar-test subscriber this will cause the publisher to get blocked in pthread_cond_broadcast()! known workarounds: when the subscriber gets killed/terminated anywhere outside of the critical section / not while blocked in _wait(), the problem does not occour! e.g. capturing the signal, and then doing a clean shutdown after pthread_cond_wait() returned with EAGAIN. this will be a major problem for us, because this synchronization is provided by means of a shared library. and we can hardly control how processes terminate. (and telling the average user how to do signal handling is also not very convincing -- also letting the library catch any/all signals to be able to return cleanly from _wait() is not a good option...) or is this usage of pthread_mutex,_cond considered to be bad/wrong? how? still its an unexpected change in behaviour and i currently don't see a clean way to solve this. (i could also successfully reproduce this issue on different machines with glibc >= 2.25, also with older 3.x and newer 4.9 kernels) -- System Information: Debian Release: buster/sid APT prefers testing APT policy: (500, 'testing') Architecture: amd64 (x86_64) Kernel: Linux 4.8.0-rt1+ (SMP w/4 CPU cores; PREEMPT) Locale: LANG=C.UTF-8, LC_CTYPE=C.UTF-8 (charmap=UTF-8), LANGUAGE=C.UTF-8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/dash Init: unable to detect Versions of packages libc6 depends on: ii libgcc1 1:7.2.0-17 libc6 recommends no packages. Versions of packages libc6 suggests: ii debconf [debconf-2.0] 1.5.65 pn glibc-doc <none> pn libc-l10n <none> pn locales <none> -- debconf information excluded
#include <stdio.h> #include <stdlib.h> #include <errno.h> #include <string.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <sys/mman.h> #include <fcntl.h> #include <pthread.h> #define SHM_NAME "test_shm" #define check_error_ret_null(func, ...) do { \ int ret = func(__VA_ARGS__); \ if(ret != 0) { \ fprintf(stderr, "error calling " #func "(): %d %s!\n", ret, strerror(ret)); \ return NULL; \ } \ } while(0) unsigned int t = 10; typedef struct { pthread_mutex_t mutex; pthread_cond_t cond; unsigned int triggered; } shm_t; shm_t* mmap_fd(int fd) { shm_t* ret = (shm_t*)mmap(NULL, sizeof(shm_t), PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE | MAP_LOCKED, fd, 0); close(fd); if((void*)ret == MAP_FAILED) { fprintf(stderr, "could not mmap: %d %s!\n", errno, strerror(errno)); return NULL; } return ret; } shm_t* create_shm() { int fd = shm_open(SHM_NAME, O_RDWR | O_CREAT, 0777); if(fd == -1) { fprintf(stderr, "could not shm_open(" SHM_NAME ", O_CREAT): %d, %s\n", errno, strerror(errno)); return NULL; } if(ftruncate(fd, sizeof(shm_t))) { fprintf(stderr, "could not ftruncate shm %d to %d\n", fd, (unsigned int)sizeof(shm_t)); return NULL; } shm_t* shm = mmap_fd(fd); if(!shm) return NULL; memset(shm, 0, sizeof(shm_t)); { // init mutex pthread_mutexattr_t attr; check_error_ret_null(pthread_mutexattr_init, &attr); check_error_ret_null(pthread_mutexattr_setpshared, &attr, PTHREAD_PROCESS_SHARED); // pshared! check_error_ret_null(pthread_mutexattr_setrobust_np, &attr, PTHREAD_MUTEX_ROBUST); // if owner dies _lock() returns EOWNERDEAD check_error_ret_null(pthread_mutexattr_setprotocol, &attr, PTHREAD_PRIO_INHERIT); check_error_ret_null(pthread_mutex_init, &shm->mutex, &attr); check_error_ret_null(pthread_mutexattr_destroy, &attr); } { // init condition variable pthread_condattr_t attr; check_error_ret_null(pthread_condattr_init, &attr); check_error_ret_null(pthread_condattr_setpshared, &attr, PTHREAD_PROCESS_SHARED); // pshared! check_error_ret_null(pthread_condattr_setclock, &attr, CLOCK_REALTIME); check_error_ret_null(pthread_cond_init, &shm->cond, &attr); check_error_ret_null(pthread_condattr_destroy, &attr); } return shm; } shm_t* open_shm() { int fd = shm_open(SHM_NAME, O_RDWR, 0777); if(fd == -1) { fprintf(stderr, "could not shm_open(" SHM_NAME "'): %d, %s\n", errno, strerror(errno)); return NULL; } return mmap_fd(fd); } int publisher() { shm_t* shm = create_shm(); if(!shm) return 1; while(true) { int ret = pthread_mutex_lock(&shm->mutex); if(ret != 0) { fprintf(stderr, "pthread_mutex_lock() returned %d: %s\n", ret, strerror(ret)); if(ret == EOWNERDEAD) pthread_mutex_consistent_np(&shm->mutex); else return 1; } shm->triggered ++; // do something printf("tick %d!\n", shm->triggered); pthread_mutex_unlock(&shm->mutex); pthread_cond_broadcast(&shm->cond); sleep(1); } return 0; } int subscriber() { shm_t* shm = open_shm(); if(!shm) return 1; unsigned int last_triggered = shm->triggered; while(true) { int ret = pthread_mutex_lock(&shm->mutex); if(ret != 0) { fprintf(stderr, "pthread_mutex_lock() returned %d: %s\n", ret, strerror(ret)); if(ret == EOWNERDEAD) pthread_mutex_consistent_np(&shm->mutex); else return 1; } while(shm->triggered == last_triggered) { pthread_cond_wait(&shm->cond, &shm->mutex); if(shm->triggered == last_triggered) printf(" ...spurious wakeup\n"); } last_triggered = shm->triggered; pthread_mutex_unlock(&shm->mutex); printf("tock %d\n", last_triggered); } return 0; } int do_start(const char* what) { if(!strcmp(what, "publisher")) return publisher(); if(!strcmp(what, "subscriber")) return subscriber(); fprintf(stderr, "invalid arg: '%s'\n", what); return 1; } int main(int argc, char* argv[]) { if(argc != 2 || do_start(argv[1])) { printf("usage:\n" "first start publisher:\n" " %s publisher\n" "then, in another shell start subscriber:\n" " %s subscriber\n" "... this should work as expected.\n" "now kill the subscriber and restart it:\n" " %s subscriber\n" "on glibc-2.25 it will block itself AND the publisher!\n", argv[0], argv[0], argv[0]); return 1; } return 0; }
i can reproduce the following problem on a debian testing(buster) with $ dpkg -s libc6:amd64 | grep Version Version: 2.25-3 the same test executed on a debian stable(stretch) with $ dpkg -s libc6:amd64 | grep Version Version: 2.24-11+deb9u1 does not show any problems and works as expected. kernel is 4.8.0 on a intel core2 quad cpu with ht disabled. gcc version 7.2.1 20171205 (Debian 7.2.0-17). ## how to reproduce build the executable by calling make: $ make start he publisher, it will create a shared-memory with a process shared mutex and condition variable: $ ./condvar-test publisher this publisher will broadcast the condition every second with a new counter value. in another terminal start the subscriber. it will grab the mutex, if there is no new counter value, it will do a blocking wait on the condition variable until it sees a new counter value. this value is then stored, the mutex unlocked and then the stored value is printed: $ ./condvar-test subscriber this is the normative case and it should work as expected. now stop the subscriber by sending an appropriate signal, you could for example press Ctrl-C on your terminal to send SIGINT: ... tock XY ^C $ now the publisher keeps running, which is fine. my expectation would now be that i can restart the subscriber and it would again print counter values. but with glibc-2.25 it does not: $ ./condvar-test subscriber it does not output anything, but what's even worse is that now the publisher blocks in the call to pthread_cond_broadcast()! a gdb backtrace of the publisher process looks like this: (gdb) bt #0 0x00007f23ce759847 in futex_wait (private=<optimized out>, expected=3, futex_word=0x7f23cf638038) at ../sysdeps/unix/sysv/linux/futex-internal.h:61 #1 futex_wait_simple (private=<optimized out>, expected=3, futex_word=0x7f23cf638038) at ../sysdeps/nptl/futex-internal.h:135 #2 __condvar_quiesce_and_switch_g1 (private=<optimized out>, g1index=<synthetic pointer>, wseq=<optimized out>, cond=0x7f23cf638028) at pthread_cond_common.c:413 #3 __pthread_cond_broadcast (cond=0x7f23cf638028) at pthread_cond_broadcast.c:73 #4 0x0000563d0df3f755 in publisher () at main.cpp:109 #5 0x0000563d0df3f895 in do_start (what=0x7ffe636bdc07 "publisher") at main.cpp:146 #6 0x0000563d0df3f97e in main (argc=2, argv=0x7ffe636bd1c8) at main.cpp:167 while the subscriber process hangs (as somewhat expected) in pthread_cond_wait(): (gdb) bt #0 0x00007ffff7118b26 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7ffff7ff4054) at ../sysdeps/unix/sysv/linux/futex-internal.h:88 #1 __pthread_cond_wait_common (abstime=0x0, mutex=0x7ffff7ff4000, cond=0x7ffff7ff4028) at pthread_cond_wait.c:502 #2 __pthread_cond_wait (cond=0x7ffff7ff4028, mutex=0x7ffff7ff4000) at pthread_cond_wait.c:655 #3 0x0000555555555820 in subscriber () at main.cpp:131 #4 0x00005555555558b3 in do_start (what=0x7fffffffebfc "subscriber") at main.cpp:148 #5 0x000055555555597e in main (argc=2, argv=0x7fffffffe978) at main.cpp:167
CXXFLAGS ?= -Wall -g -O0 -pthread LDFLAGS ?= -pthread -lrt condvar-test: main.cpp Makefile $(CXX) -o $@ $< $(CXXFLAGS) $(LDFLAGS)