Problem: A common use care for OMP is to accelerate the internal workings of an otherwise serial interface. For example, OpenBLAS in some settings will internally use OMP to accelerate the implementation of matrix-matrix multiply (DGEMM). When DGEMM is called, then an OMP section is started, the work is done, then the OMP section exits, the program returns to serial mode, and DGEMM returns. All this is entirely transparent to the user -- in fact, it's common for users to switch between different linear algebra cores (BLAS libraries) without recompiling, so it's impossible for code that uses linear algebra to know which underlying library is in use, or how it has been compiled.
However, in order to support some corners of the OMP spec, it is important that the threads that were started to implement an OMP parallel section be kept around, in case another OMP section has started. (AFAICT this is only true when "threadprivate" variables are in use. Unfortunately AFAICT there is currently no way to determine whether this is the case -- such variables are handled directly by GCC without calling into libgomp, so we can't tell at runtime whether they exist.) And, this causes a big problem and abstraction leak: it means that if you use OMP (e.g., by multiplying two matrices), and then fork, and then the child also uses OMP (e.g., by also multiplying two matrices), then the child immediately deadlocks (as OMP waits for threads that it thinks still exist, but that disappeared during the fork). The result is that it simply *is not possible to know* whether fork() will actually work as advertised, even when writing purely serial code, if that code happens to do seemingly innocent things like linear algebra. And this then ends up causing surprising wreakage in far-flung parts of the numerical ecosystem (e.g., here's someone trying to figure figure out why their web site's task manager crashes whenever they try to plot a graph: https://github.com/celery/celery/issues/1842). (Somewhat more impassioned rant and references to previous discussions here: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60035) In practice, GOMP seems to be the only OMP implementation that suffers from this problem; people who encounter this problem are often advised to switch to icc. There does not appear to be any fully POSIX-compliant way to solve this problem (not least because in a strict reading of the POSIX spec, you aren't guaranteed to be able to do practically *anything* after a fork() in any program which has ever called a pthreads_* function). In a less strict reading, we might expect to be okay if no threads are actually running at the time that fork() is called -- but, we can't shut down OMP threads before forking, because of the issue with threadprivate variables -- it might change the behaviour of compliant programs. But in practice, if the fork() occurs at a time when every thread is just sitting waiting on a barrier, then we can be pretty sure that libc etc. will be in a generally thread-consistent state. And in practice, the few truly dangerous operations we need to clean up after the fact -- e.g., destroying that barrier -- do seem to work, at least on Linux. The attached patch, therefore, takes this strategy. Crucially, it should have no impact on compliant programs, because it doesn't actually do anything except set/check a single global variable until the user actually enters an OMP section in the child, at which case they have already violated POSIX, so we might as well cross our fingers and hope for the best. (At the very least, the included test does fail on Linux x86-64 without the patch, and passes with the patch.) Other options that might be worth considering: -- Adding some way for libgomp to determine whether threadprivate variables are in use, and then using this information to shut down threads in a pre-fork handler iff doing so is safe. -- Instead of trying to clean up the various mutex/barrier/semaphore detritus left in the child by the evaporating threads, we could simply leak them. I don't know which is worse in practice: a small leak (once per child process), or the risk that the various *_destroy functions will blow up (as POSIX allows them to do). ChangeLog: 2014-02-12 Nathaniel J. Smith <n...@pobox.com> * team.c (gomp_free_pool_helper): Move per-thread cleanup to main thread. (gomp_free_thread): Delegate implementation to... (gomp_free_thread_pool): ...this new function. Like old gomp_free_thread, but does per-thread cleanup, and has option to skip everything that involves interacting with actual threads, which is useful when called after fork. (gomp_after_fork_callback): New function. (gomp_team_start): Register atfork handler, and check for fork on entry. -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org
Index: testsuite/libgomp.c/fork-1.c =================================================================== --- testsuite/libgomp.c/fork-1.c (revision 0) +++ testsuite/libgomp.c/fork-1.c (working copy) @@ -0,0 +1,77 @@ +/* { dg-do run } */ +/* { dg-timeout 10 } */ + +#include <omp.h> +#include <string.h> +#include <sys/wait.h> +#include <unistd.h> +#include <assert.h> + +static int saw[4]; + +static void +check_parallel (int exit_on_failure) +{ + memset (saw, 0, sizeof (saw)); + #pragma omp parallel num_threads (2) + { + int iam = omp_get_thread_num (); + saw[iam] = 1; + } + + // Encode failure in status code to report to parent process + if (exit_on_failure) + { + if (saw[0] != 1) + _exit(1); + else if (saw[1] != 1) + _exit(2); + else if (saw[2] != 0) + _exit(3); + else if (saw[3] != 0) + _exit(4); + else + _exit(0); + } + // Use regular assertions + else + { + assert (saw[0] == 1); + assert (saw[1] == 1); + assert (saw[2] == 0); + assert (saw[3] == 0); + } +} + +int +main () +{ + // Initialize the OMP thread pool in the parent process + check_parallel (0); + pid_t fork_pid = fork(); + if (fork_pid == -1) + return 1; + else if (fork_pid == 0) + { + // Call OMP again in the child process and encode failures in exit + // code. + check_parallel (1); + } + else + { + // Check that OMP runtime is still functional in parent process after + // the fork. + check_parallel (0); + + // Wait for the child to finish and check the exit code. + int child_status = 0; + pid_t wait_pid = wait(&child_status); + assert (wait_pid == fork_pid); + assert (WEXITSTATUS (child_status) == 0); + + // Check that the termination of the child process did not impact + // OMP in parent process. + check_parallel (0); + } + return 0; +} Index: team.c =================================================================== --- team.c (revision 207398) +++ team.c (working copy) @@ -43,6 +43,8 @@ __thread struct gomp_thread gomp_tls_data; pthread_key_t gomp_tls_key; #endif +/* This is to enable best-effort cleanup after fork. */ +static int gomp_we_are_forked = 0; /* This structure is used to communicate across pthread_create. */ @@ -204,42 +206,41 @@ static struct gomp_thread_pool *gomp_new_thread_po return pool; } +/* Free a thread pool and release its threads. */ + static void gomp_free_pool_helper (void *thread_pool) { - struct gomp_thread *thr = gomp_thread (); struct gomp_thread_pool *pool = (struct gomp_thread_pool *) thread_pool; gomp_barrier_wait_last (&pool->threads_dock); - gomp_sem_destroy (&thr->release); - thr->thread_pool = NULL; - thr->task = NULL; pthread_exit (NULL); } -/* Free a thread pool and release its threads. */ - -void -gomp_free_thread (void *arg __attribute__((unused))) +static void +gomp_free_thread_pool (int threads_running) { struct gomp_thread *thr = gomp_thread (); struct gomp_thread_pool *pool = thr->thread_pool; if (pool) { + int i; if (pool->threads_used > 0) { - int i; - for (i = 1; i < pool->threads_used; i++) + if (threads_running) { - struct gomp_thread *nthr = pool->threads[i]; - nthr->fn = gomp_free_pool_helper; - nthr->data = pool; + for (i = 1; i < pool->threads_used; i++) + { + struct gomp_thread *nthr = pool->threads[i]; + nthr->fn = gomp_free_pool_helper; + nthr->data = pool; + } + /* This barrier undocks threads docked on pool->threads_dock. */ + gomp_barrier_wait (&pool->threads_dock); + /* And this waits till all threads have called + gomp_barrier_wait_last in gomp_free_pool_helper. */ + gomp_barrier_wait (&pool->threads_dock); } - /* This barrier undocks threads docked on pool->threads_dock. */ - gomp_barrier_wait (&pool->threads_dock); - /* And this waits till all threads have called gomp_barrier_wait_last - in gomp_free_pool_helper. */ - gomp_barrier_wait (&pool->threads_dock); /* Now it is safe to destroy the barrier and free the pool. */ gomp_barrier_destroy (&pool->threads_dock); @@ -251,6 +252,14 @@ gomp_free_pool_helper (void *thread_pool) gomp_managed_threads -= pool->threads_used - 1L; gomp_mutex_unlock (&gomp_managed_threads_lock); #endif + /* Clean up thread objects */ + for (i = 1; i < pool->threads_used; i++) + { + struct gomp_thread *nthr = pool->threads[i]; + gomp_sem_destroy (&nthr->release); + nthr->thread_pool = NULL; + nthr->task = NULL; + } } free (pool->threads); if (pool->last_team) @@ -266,6 +275,58 @@ gomp_free_pool_helper (void *thread_pool) } } +/* This is called whenever a thread exits which has a non-NULL value for + gomp_thread_destructor. In practice, the only thread for which this occurs + is the one which created the thread pool. +*/ +void +gomp_free_thread (void *arg __attribute__((unused))) +{ + gomp_free_thread_pool (1); +} + +/* This is called in the child process after a fork. + + According to POSIX, if a process which uses threads calls fork(), then + there are very few things that the resulting child process can do safely -- + mostly just exec(). + + However, in practice, (almost?) all POSIX implementations seem to allow + arbitrary code to run inside the child, *if* the parent process's threads + are in a well-defined state when the fork occurs. And this circumstance can + easily arise in OMP-using programs, e.g. when a library function like DGEMM + uses OMP internally, and some other unrelated part of the program calls + fork() at some other time, when no OMP sections are running. + + Therefore, we make a best effort attempt to handle the case: + + OMP section (in parent) -> quiesce -> fork -> OMP section (in child) + + "Best-effort" here means that: + - Your system may or may not be able to handle this kind of code at all; + our goal is just to make sure that if it fails it's not gomp's fault. + - All threadprivate variables will be reset in the child. Fortunately this + is entirely compliant with the spec, according to the rule of nasal + demons. + - We must have minimal speed impact, and no correctness impact, on + compliant programs. + + We use this callback to notice when a fork has a occurred, and if the child + later attempts to enter an OMP section (via gomp_team_start), then we know + that it is non-compliant, and are free to apply our best-effort strategy of + cleaning up the old thread pool structures and spawning a new one. Because + compliant programs never call gomp_team_start after forking, they are + unaffected. +*/ +static void +gomp_after_fork_callback () +{ + /* Only "async-signal-safe operations" are allowed here, so let's keep it + simple. No mutex is needed, because we are currently single-threaded. + */ + gomp_we_are_forked = 1; +} + /* Launch a team. */ void @@ -288,11 +349,19 @@ gomp_team_start (void (*fn) (void *), void *data, thr = gomp_thread (); nested = thr->ts.team != NULL; + if (__builtin_expect (gomp_we_are_forked, 0)) + { + gomp_free_thread_pool (0); + gomp_we_are_forked = 0; + } if (__builtin_expect (thr->thread_pool == NULL, 0)) { thr->thread_pool = gomp_new_thread_pool (); thr->thread_pool->threads_busy = nthreads; + /* The pool should be cleaned up whenever this thread exits... */ pthread_setspecific (gomp_thread_destructor, thr); + /* ...and also in any fork()ed children. */ + pthread_atfork (NULL, NULL, &gomp_after_fork_callback); } pool = thr->thread_pool; task = thr->task;