[issue34781] infinite waiting in multiprocessing.Pool
New submission from Tomáš Bouda : I have encountered a possible bug inside multiprocessing.Pool which behaves like race-condition while I don't believe it is a typical one. Simply put, Pool from time to time freezes. It is occasional and hard to reproduce, but e.g. unit-tests running 3/day freeze several times a week. We are using Pool heavily in our applications. Usually tens of workers and heavy load for each one of them. This production environment is using Python 2.7 (RHEL) and custom build, etc. However, I reproduced the same behavior in Python 3.6 (OSX) on my local machine. When I run the following script like 20x, I get one or two frozen instances. You may notice in the output that ForkPoolWorker-42 never calls self.run(). The application than freezes as-is since it is probably waiting for the process. It is easier to reproduce the behavior using debugger (PyCharm-Pro in my case), however, in our production environment there is just clean run, the bug occurs more often since multiprocessing is used quite a lot in there. Thanks, Tomas --- My script: import logging from multiprocessing.pool import Pool from multiprocessing.util import log_to_stderr def f(i): print(i) log_to_stderr(logging.DEBUG) pool = Pool(50) pool.map(f, range(2)) pool.close() pool.join() --- Output: [DEBUG/MainProcess] created semlock with handle 9 [DEBUG/MainProcess] created semlock with handle 10 [DEBUG/MainProcess] created semlock with handle 13 [DEBUG/MainProcess] created semlock with handle 14 [DEBUG/MainProcess] added worker [DEBUG/MainProcess] added worker [DEBUG/MainProcess] added worker [DEBUG/MainProcess] added worker [INFO/ForkPoolWorker-1] child process calling self.run() [DEBUG/MainProcess] added worker [INFO/ForkPoolWorker-2] child process calling self.run() [DEBUG/MainProcess] added worker [DEBUG/MainProcess] added worker [INFO/ForkPoolWorker-3] child process calling self.run() [DEBUG/MainProcess] added worker [DEBUG/MainProcess] added worker [INFO/ForkPoolWorker-4] child process calling self.run() [DEBUG/MainProcess] added worker [DEBUG/MainProcess] added worker [DEBUG/MainProcess] added worker [DEBUG/MainProcess] added worker [INFO/ForkPoolWorker-5] child process calling self.run() [DEBUG/MainProcess] added worker [INFO/ForkPoolWorker-6] child process calling self.run() [DEBUG/MainProcess] added worker [INFO/ForkPoolWorker-7] child process calling self.run() [INFO/ForkPoolWorker-9] child process calling self.run() [INFO/ForkPoolWorker-10] child process calling self.run() [INFO/ForkPoolWorker-8] child process calling self.run() [INFO/ForkPoolWorker-12] child process calling self.run() [INFO/ForkPoolWorker-13] child process calling self.run() [INFO/ForkPoolWorker-11] child process calling self.run() [DEBUG/MainProcess] added worker [INFO/ForkPoolWorker-14] child process calling self.run() [DEBUG/MainProcess] added worker [INFO/ForkPoolWorker-15] child process calling self.run() [DEBUG/MainProcess] added worker [DEBUG/MainProcess] added worker [DEBUG/MainProcess] added worker [DEBUG/MainProcess] added worker [DEBUG/MainProcess] added worker [DEBUG/MainProcess] added worker [DEBUG/MainProcess] added worker [DEBUG/MainProcess] added worker [INFO/ForkPoolWorker-16] child process calling self.run() [INFO/ForkPoolWorker-17] child process calling self.run() [INFO/ForkPoolWorker-18] child process calling self.run() [DEBUG/MainProcess] added worker [INFO/ForkPoolWorker-19] child process calling self.run() [INFO/ForkPoolWorker-20] child process calling self.run() [DEBUG/MainProcess] added worker [INFO/ForkPoolWorker-21] child process calling self.run() [DEBUG/MainProcess] added worker [DEBUG/MainProcess] added worker [INFO/ForkPoolWorker-22] child process calling self.run() [DEBUG/MainProcess] added worker [INFO/ForkPoolWorker-23] child process calling self.run() [DEBUG/MainProcess] added worker [INFO/ForkPoolWorker-24] child process calling self.run() [INFO/ForkPoolWorker-25] child process calling self.run() [INFO/ForkPoolWorker-26] child process calling self.run() [INFO/ForkPoolWorker-27] child process calling self.run() [INFO/ForkPoolWorker-28] child process calling self.run() [DEBUG/MainProcess] added worker [INFO/ForkPoolWorker-29] child process calling self.run() [DEBUG/MainProcess] added worker [INFO/ForkPoolWorker-30] child process calling self.run() [INFO/ForkPoolWorker-31] child process calling self.run() [DEBUG/MainProcess] added worker [DEBUG/MainProcess] added worker [DEBUG/MainProcess] added worker [INFO/ForkPoolWorker-32] child process calling self.run() [DEBUG/MainProcess] added worker [DEBUG/MainProcess] added worker [INFO/ForkPoolWorker-33] child process calling self.run() [INFO/ForkPoolWorker-34] child process calling self.run() [INFO/ForkPoolWorker-35] child process calling self.run() [INFO/ForkPoolWorker-36] child process calling self.run() [INFO/ForkPoolWorker-37] child process calling self.run() [INFO/ForkPoolWorker-38] child process calling self.run() [DEBUG/MainPr
[issue34781] infinite waiting in multiprocessing.Pool
Tomáš Bouda added the comment: After more digging, I found that the following happens: popen_fork.py -> _launch(self, process_obj) -> self.pid = os.fork() When I let process (both child and parent) print resulting pid, on freezing I can see: a) 50-times pid > 0 b) 49-times pid == 0 That means the parent is aware of 50 children, while only 49 of them get to the next line. Not sure if the one remaining process crashes on segfault, but parent apparently hangs later in os.waitpid() on this valid pid of the missing child. -- ___ Python tracker <https://bugs.python.org/issue34781> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue34781] infinite waiting in multiprocessing.Pool
Tomáš Bouda added the comment: It's very difficult to reproduce. In this example to get stuck on 3.6/OSX I need to attach debugger. However, the freeze happens regardless of print/logging, even def f(): pass can get stuck. os.write() made no difference and frozen, as well, as I've just tried. It might be even possible that there's another problem in debugger itself, however on 2.7/RHEL the actual (production) code is embedded in unittest with "discover" mode, run from shell, no debugger attached. I couldn't reproduce it today morning. Later on afternoon in another script it occured 2 times in a row. In the past months, I have seen the problem under various conditions. Originally, I used ProcessPoolExecutor before, where it happened rather often, so I rewrote the code to use Pool directly, the problem is rare, now, but still occurs. The production code has many variants including logging/custom prints/no prints at all, from time to time it happens regardless of anything else. However, there's also a heavy load and high OS resource demand (tens of workers, tens of GBs read/allocated, many/explicit calls to GC, etc.) -- ___ Python tracker <https://bugs.python.org/issue34781> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue34781] infinite waiting in multiprocessing.Pool
Tomáš Bouda added the comment: Oh, I should add that by decreasing number of workers to 4 or 8 the problem disappeared, at least to the extent when I wasn't able to reproduce it on any environment. -- ___ Python tracker <https://bugs.python.org/issue34781> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue34781] infinite waiting in multiprocessing.Pool
Tomáš Bouda added the comment: By now I have spent several days trying to reproduce the behaviour in production environment with debugger attached. Unfortunately, no success. On the other hand yesterday the application froze, again, and colleague today experienced the problem in his script, too. (talking about RHEL) Dealing with this kind of problem is always very frustrating. By now, I agree with @pitrou that OSX/RHEL could be two different problems. In advance, I tried the approach by @calimeroteknik and this would actually make sense. If the child process receives a signal (SIGTERM or SIGSEGV), parent waits forever. We do call 3rd party libraries and segfault is indeed possible. I've tried to send signal to a child and script really froze. By now, it seems to be the most probable explanation. OSX debugger may also be buggy, yesterday I completely broke my system just by trying my original script, leading to a regular segfaults and system restart (never happened before). Since I can't reproduce the problem under controlled conditions, I am ok with closing this bug. The script by @calimeroteknik seems to be pointing in the same direction and I think this may solve our problem, too. -- ___ Python tracker <https://bugs.python.org/issue34781> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com