[issue47139] pthread_sigmask needs SIG_BLOCK behaviour explaination
New submission from Richard Purdie : I've been struggling to get signal.pthread_sigmask to do what I expected it to do from the documentation. Having looked at the core python code handling signals I now (think?!) I understand what is happening. It might be possible for python to improve the behaviour, or it might just be something to document, I'm not sure but I though I'd mention it. I'd added pthread_sigmask(SIG_BLOCK, (SIGTERM,)) and pthread_sigmask(SIG_UNBLOCK, (SIGTERM,)) calls around a critical section I wanted to protect from the SIGTERM signal. I was still seeing SIGTERM inside that section. Using SIGMASK to restore the mask instead of SIG_UNBLOCK behaves the same. What I hadn't realised is that firstly python defers signals to a convenient point and secondly that signals are processed in the main thread regardless of the thread they arrived in. This means that I can see SIGTERM arrive in my critical section as one of my other threads created in the background by the core python libs helpfully handles it. This makes SIG_BLOCK rather ineffective in any threaded code. To work around it, I can add my own handlers and have them track whether a signal arrived, then handle any signals after my critical section by re-raising them. It is possible python itself could defer processing signals masked with SIG_BLOCK until they're unblocked. Alternatively, a note in the documentation warning of the pitfalls here might be helpful to save someone else from wondering what is going on! -- components: Interpreter Core messages: 416154 nosy: rpurdie priority: normal severity: normal status: open title: pthread_sigmask needs SIG_BLOCK behaviour explaination versions: Python 3.10 ___ Python tracker <https://bugs.python.org/issue47139> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue47195] importlib lock race issue in deadlock handling code
New submission from Richard Purdie : We've seen tracebacks in production like: File "", line 1004, in _find_and_load(name='oe.gpg_sign', import_=) File "", line 158, in _ModuleLockManager.__enter__() File "", line 110, in _ModuleLock.acquire() KeyError: 139622474778432 and File "", line 1004, in _find_and_load(name='oe.path', import_=) File "", line 158, in _ModuleLockManager.__enter__() File "", line 110, in _ModuleLock.acquire() KeyError: 140438942700992 I've attached a reproduction script which shows that if an import XXX is in progress and waiting at the wrong point when an interrupt arrives (in this case a signal) and triggers it's own import YYY, _blocking_on[tid] in importlib/_bootstrap.py gets overwritten and lost, triggering the traceback we see above upon exit from the second import. I'm using a signal handler here as the interrupt, I don't know what our production source is as yet but this reproducer proves it is possible. -- components: Interpreter Core files: testit2.py messages: 416517 nosy: rpurdie priority: normal severity: normal status: open title: importlib lock race issue in deadlock handling code versions: Python 3.10 Added file: https://bugs.python.org/file50714/testit2.py ___ Python tracker <https://bugs.python.org/issue47195> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue47195] importlib lock race issue in deadlock handling code
Richard Purdie added the comment: This is a production backtrace after I inserted code to traceback if tid was already in _blocking_on. It is being triggered by a warning about an unclosed asyncio event loop and confirms my theory about nested imports, in the production case I'd guess being triggered by gc given the __del__. File "/home/pokybuild/yocto-worker/oe-selftest-fedora/build/meta/classes/base.bbclass", line 26, in oe_import import oe.data File "", line 1024, in _find_and_load File "", line 171, in __enter__ File "/home/pokybuild/yocto-worker/oe-selftest-fedora/build/bitbake/lib/bb/cooker.py", line 168, in acquire return orig_acquire(self) File "", line 110, in acquire File "/usr/lib64/python3.10/asyncio/base_events.py", line 685, in __del__ _warn(f"unclosed event loop {self!r}", ResourceWarning, source=self) File "/usr/lib64/python3.10/warnings.py", line 112, in _showwarnmsg _showwarnmsg_impl(msg) File "/usr/lib64/python3.10/warnings.py", line 28, in _showwarnmsg_impl text = _formatwarnmsg(msg) File "/usr/lib64/python3.10/warnings.py", line 128, in _formatwarnmsg return _formatwarnmsg_impl(msg) File "/usr/lib64/python3.10/warnings.py", line 56, in _formatwarnmsg_impl import tracemalloc File "", line 1024, in _find_and_load File "", line 171, in __enter__ File "/home/pokybuild/yocto-worker/oe-selftest-fedora/build/bitbake/lib/bb/cooker.py", line 167, in acquire bb.warn("\n".join(traceback.format_stack())) -- ___ Python tracker <https://bugs.python.org/issue47195> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue47139] pthread_sigmask needs SIG_BLOCK behaviour explaination
Richard Purdie added the comment: I think the python code implementing pthread_sigmask already does trigger interrupts if any have been queued before the function returns from blocking or unblocking. The key subtlety which I initially missed is that if you have another thread in your python script, any interrupt it receives can be raised in the main thread whilst you're in the SIGBLOCK section. This obviously isn't what you expect at all as those interrupts are supposed to be blocked! It isn't really practical to try and SIGBLOCK on all your individual threads. What I'd wondered is what you mention, specifically checking if a signal is masked in the python signal raising code with something like the "pthread_sigmask(SIG_UNBLOCK, NULL /* set */, &oldset)" before it raises it and if there is blocked, just leave it queued. The current code would trigger the interrupts when it was unmasked. This would effectively only apply on the main thread where all the signals/interrupts are raised. This would certainly give the behaviour that would be expected from the calls and save everyone implementing the workarounds as I have. Due to the threads issue, I'm not sure SIGBLOCK is actually useful in the real world with the current implementation unfortunately. Equally, if that isn't an acceptable fix, documenting it would definitely be good too. -- ___ Python tracker <https://bugs.python.org/issue47139> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue47258] Python 3.10 hang at exit in drop_gil() (due to resource warning at exit?)
New submission from Richard Purdie : We had a python hang at shutdown. The gdb python backtrace and C backtraces are below. It is hung in the COND_WAIT(gil->switch_cond, gil->switch_mutex) call in drop_gil(). Py_FinalizeEx -> handle_system_exit() -> PyGC_Collect -> handle_weakrefs -> drop_gil I think from the stack trace it may have been printing the warning: sys:1: ResourceWarning: unclosed file <_io.TextIOWrapper name='/home/pokybuild/yocto-worker/oe-selftest-fedora/build/build-st-1560250/bitbake-cookerdaemon.log' mode='a+' encoding='UTF-8'> however I'm not sure if it was that or trying to show a different exception. Even if we have a resource leak, it shouldn't really hang! (gdb) py-bt Traceback (most recent call first): File "/usr/lib64/python3.10/weakref.py", line 106, in remove def remove(wr, selfref=ref(self), _atomic_removal=_remove_dead_weakref): Garbage-collecting #0 __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f0f7bd54b20 <_PyRuntime+512>) at futex-internal.c:57 #1 __futex_abstimed_wait_common (futex_word=futex_word@entry=0x7f0f7bd54b20 <_PyRuntime+512>, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0, cancel=cancel@entry=true) at futex-internal.c:87 #2 0x7f0f7b88979f in __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7f0f7bd54b20 <_PyRuntime+512>, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at futex-internal.c:139 #3 0x7f0f7b88beb0 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f0f7bd54b28 <_PyRuntime+520>, cond=0x7f0f7bd54af8 <_PyRuntime+472>) at pthread_cond_wait.c:504 #4 ___pthread_cond_wait (cond=cond@entry=0x7f0f7bd54af8 <_PyRuntime+472>, mutex=mutex@entry=0x7f0f7bd54b28 <_PyRuntime+520>) at pthread_cond_wait.c:619 #5 0x7f0f7bb388d8 in drop_gil (ceval=0x7f0f7bd54a78 <_PyRuntime+344>, ceval2=, tstate=0x558744ef7c10) at /usr/src/debug/python3.10-3.10.4-1.fc35.x86_64/Python/ceval_gil.h:182 #6 0x7f0f7bb223e8 in eval_frame_handle_pending (tstate=) at /usr/src/debug/python3.10-3.10.4-1.fc35.x86_64/Python/ceval.c:1185 #7 _PyEval_EvalFrameDefault (tstate=, f=, throwflag=) at /usr/src/debug/python3.10-3.10.4-1.fc35.x86_64/Python/ceval.c:1775 #8 0x7f0f7bb19600 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7f0f7a0c8a60, for file /usr/lib64/python3.10/weakref.py, line 106, in remove (wr=, selfref=, _atomic_removal=), tstate=0x558744ef7c10) at /usr/src/debug/python3.10-3.10.4-1.fc35.x86_64/Include/internal/pycore_ceval.h:46 #9 _PyEval_Vector (tstate=, con=, locals=, args=, argcount=1, kwnames=) at /usr/src/debug/python3.10-3.10.4-1.fc35.x86_64/Python/ceval.c:5065 #10 0x7f0f7bb989a8 in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=9223372036854775809, args=0x7fff8b815bc8, callable=, tstate=0x558744ef7c10) at /usr/src/debug/python3.10-3.10.4-1.fc35.x86_64/Include/cpython/abstract.h:114 #11 PyObject_CallOneArg (func=, arg=) at /usr/src/debug/python3.10-3.10.4-1.fc35.x86_64/Include/cpython/abstract.h:184 #12 0x7f0f7bb0fce1 in handle_weakrefs (old=0x558744edbd30, unreachable=0x7fff8b815c70) at /usr/src/debug/python3.10-3.10.4-1.fc35.x86_64/Modules/gcmodule.c:887 #13 gc_collect_main (tstate=0x558744ef7c10, generation=2, n_collected=0x7fff8b815d50, n_uncollectable=0x7fff8b815d48, nofail=0) at /usr/src/debug/python3.10-3.10.4-1.fc35.x86_64/Modules/gcmodule.c:1281 #14 0x7f0f7bb9194e in gc_collect_with_callback (tstate=tstate@entry=0x558744ef7c10, generation=generation@entry=2) at /usr/src/debug/python3.10-3.10.4-1.fc35.x86_64/Modules/gcmodule.c:1413 #15 0x7f0f7bbc827e in PyGC_Collect () at /usr/src/debug/python3.10-3.10.4-1.fc35.x86_64/Modules/gcmodule.c:2099 #16 0x7f0f7bbc7bc2 in Py_FinalizeEx () at /usr/src/debug/python3.10-3.10.4-1.fc35.x86_64/Python/pylifecycle.c:1781 #17 0x7f0f7bbc7d7c in Py_Exit (sts=0) at /usr/src/debug/python3.10-3.10.4-1.fc35.x86_64/Python/pylifecycle.c:2858 #18 0x7f0f7bbc4fbb in handle_system_exit () at /usr/src/debug/python3.10-3.10.4-1.fc35.x86_64/Python/pythonrun.c:775 #19 0x7f0f7bbc4f3d in _PyErr_PrintEx (set_sys_last_vars=1, tstate=0x558744ef7c10) at /usr/src/debug/python3.10-3.10.4-1.fc35.x86_64/Python/pythonrun.c:785 #20 PyErr_PrintEx (set_sys_last_vars=1) at /usr/src/debug/python3.10-3.10.4-1.fc35.x86_64/Python/pythonrun.c:880 #21 0x7f0f7bbbcece in PyErr_Print () at /usr/src/debug/python3.10-3.10.4-1.fc35.x86_64/Python/pythonrun.c:886 #22 _PyRun_SimpleFileObject (fp=, filename=, closeit=1, flags=0x7fff8b815f18) at /usr/src/debug/python3.10-3.10.4-1.fc35.x86_64/Python/pythonrun.c:462 #23 0x7f0f7bbbcc57 in _PyRun_AnyFileObject (fp=0x558744ed9370, filename='/home/pokybuild/y
[issue41714] multiprocessing.Queue deadlock
New submission from Richard Purdie : We're having some problems with multiprocessing.Queue where the parent process ends up hanging with zombie children. The code is part of bitbake, the task execution engine behind OpenEmbedded/Yocto Project. I've cut down our code to the pieces in question in the attached file. It doesn't give a runnable test case unfortunately but does at least show what we're doing. Basically, we have a set of items to parse, we create a set of multiprocessing.Process() processes to handle the parsing in parallel. Jobs are queued in one queue and results are fed back to the parent via another. There is a quit queue that takes sentinels to cause the subprocesses to quit. If something fails to parse, shutdown with clean=False is called, the sentinels are sent. the Parser() process calls results.cancel_join_thread() on the results queue. We do this since we don't care about the results any more, we just want to ensure everyting exits cleanly. This is where things go wrong. The Parser processes and their queues all turn into zombies. The parent process ends up stuck in self.result_queue.get(timeout=0.25) inside shutdown(). strace shows its acquired the locks and is doing a read() on the os.pipe() it created. Unfortunately since the parent still has a write channel open to the same pipe, it hangs indefinitely. If I change the code to do: self.result_queue._writer.close() while True: try: self.result_queue.get(timeout=0.25) except (queue.Empty, EOFError): break i.e. close the writer side of the pipe by poking at the queue internals, we don't see the hang. The .close() method would close both sides. We create our own process pool since this code dates from python 2.x days and multiprocessing pools had issues back when we started using this. I'm sure it would be much better now but we're reluctant to change what has basically been working. We drain the queues since in some cases we have clean shutdowns where cancel_join_thread() hasn't been used and we don't want those cases to block. My question is whether this is a known issue and whether there is some kind of API to close just the write side of the Queue to avoid problems like this? -- components: Library (Lib) files: simplified.py messages: 376350 nosy: rpurdie priority: normal severity: normal status: open title: multiprocessing.Queue deadlock type: crash versions: Python 3.6 Added file: https://bugs.python.org/file49444/simplified.py ___ Python tracker <https://bugs.python.org/issue41714> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue41714] multiprocessing.Queue deadlock
Richard Purdie added the comment: I should also add that if we don't use cancel_join_thread() in the parser processes, things all work out ok. There is therefore seemingly something odd about the state that is leaving things in. This issue doesn't occur every time, its maybe 1 in 40 runs where we throw parsing errors but I can brute force reproduce it. -- ___ Python tracker <https://bugs.python.org/issue41714> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue41714] multiprocessing.Queue deadlock
Richard Purdie added the comment: Even my hack to call _writer.close() doesn't seem to be enough, it makes the problem rarer but there is still an issue. Basically, if you call cancel_join_thread() in one process, the queue is potentially totally broken in all other processes that may be using it. If for example another has called join_thread() as it was exiting and has queued data at the same time as another process exits using cancel_join_thread() and exits holding the write lock, you'll deadlock on the processes now stuck in join_thread() waiting for a lock they'll never get. I suspect the answer is "don't use cancel_join_thread()" but perhaps the docs need a note to point out that if anything is already potentially exiting, it can deadlock? I'm not sure you can actually use the API safely unless you stop all users from exiting and synchronise that by other means? -- ___ Python tracker <https://bugs.python.org/issue41714> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com