[issue47139] pthread_sigmask needs SIG_BLOCK behaviour explaination

2022-03-28 Thread Richard Purdie


New submission from Richard Purdie :

I've been struggling to get signal.pthread_sigmask to do what I expected it to 
do from the documentation. Having looked at the core python code handling 
signals I now (think?!) I understand what is happening. It might be possible 
for python to improve the behaviour, or it might just be something to document, 
I'm not sure but I though I'd mention it.

I'd added pthread_sigmask(SIG_BLOCK, (SIGTERM,)) and 
pthread_sigmask(SIG_UNBLOCK, (SIGTERM,)) calls around a critical section I 
wanted to protect from the SIGTERM signal. I was still seeing SIGTERM inside 
that section. Using SIGMASK to restore the mask instead of SIG_UNBLOCK behaves 
the same.

What I hadn't realised is that firstly python defers signals to a convenient 
point and secondly that signals are processed in the main thread regardless of 
the thread they arrived in. This means that I can see SIGTERM arrive in my 
critical section as one of my other threads created in the background by the 
core python libs helpfully handles it.  This makes SIG_BLOCK rather ineffective 
in any threaded code.

To work around it, I can add my own handlers and have them track whether a 
signal arrived, then handle any signals after my critical section by re-raising 
them. It is possible python itself could defer processing signals masked with 
SIG_BLOCK until they're unblocked. Alternatively, a note in the documentation 
warning of the pitfalls here might be helpful to save someone else from 
wondering what is going on!

--
components: Interpreter Core
messages: 416154
nosy: rpurdie
priority: normal
severity: normal
status: open
title: pthread_sigmask needs SIG_BLOCK behaviour explaination
versions: Python 3.10

___
Python tracker 
<https://bugs.python.org/issue47139>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue47195] importlib lock race issue in deadlock handling code

2022-04-01 Thread Richard Purdie


New submission from Richard Purdie :

We've seen tracebacks in production like:

  File "", line 1004, in 
_find_and_load(name='oe.gpg_sign', import_=)
  File "", line 158, in 
_ModuleLockManager.__enter__()
  File "", line 110, in _ModuleLock.acquire()
 KeyError: 139622474778432

and

  File "", line 1004, in 
_find_and_load(name='oe.path', import_=)
  File "", line 158, in 
_ModuleLockManager.__enter__()
  File "", line 110, in _ModuleLock.acquire()
 KeyError: 140438942700992

I've attached a reproduction script which shows that if an import XXX is in 
progress and waiting at the wrong point when an interrupt arrives (in this case 
a signal) and triggers it's own import YYY, _blocking_on[tid] in 
importlib/_bootstrap.py gets overwritten and lost, triggering the traceback we 
see above upon exit from the second import.

I'm using a signal handler here as the interrupt, I don't know what our 
production source is as yet but this reproducer proves it is possible.

--
components: Interpreter Core
files: testit2.py
messages: 416517
nosy: rpurdie
priority: normal
severity: normal
status: open
title: importlib lock race issue in deadlock handling code
versions: Python 3.10
Added file: https://bugs.python.org/file50714/testit2.py

___
Python tracker 
<https://bugs.python.org/issue47195>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue47195] importlib lock race issue in deadlock handling code

2022-04-01 Thread Richard Purdie


Richard Purdie  added the comment:

This is a production backtrace after I inserted code to traceback if tid was 
already in _blocking_on. It is being triggered by a warning about an unclosed 
asyncio event loop and confirms my theory about nested imports, in the 
production case I'd guess being triggered by gc given the __del__.

  File 
"/home/pokybuild/yocto-worker/oe-selftest-fedora/build/meta/classes/base.bbclass",
 line 26, in oe_import
import oe.data
  File "", line 1024, in _find_and_load
  File "", line 171, in __enter__
  File 
"/home/pokybuild/yocto-worker/oe-selftest-fedora/build/bitbake/lib/bb/cooker.py",
 line 168, in acquire
return orig_acquire(self)
  File "", line 110, in acquire
  File "/usr/lib64/python3.10/asyncio/base_events.py", line 685, in __del__
_warn(f"unclosed event loop {self!r}", ResourceWarning, source=self)
  File "/usr/lib64/python3.10/warnings.py", line 112, in _showwarnmsg
_showwarnmsg_impl(msg)
  File "/usr/lib64/python3.10/warnings.py", line 28, in _showwarnmsg_impl
text = _formatwarnmsg(msg)
  File "/usr/lib64/python3.10/warnings.py", line 128, in _formatwarnmsg
return _formatwarnmsg_impl(msg)
  File "/usr/lib64/python3.10/warnings.py", line 56, in _formatwarnmsg_impl
import tracemalloc
  File "", line 1024, in _find_and_load
  File "", line 171, in __enter__
  File 
"/home/pokybuild/yocto-worker/oe-selftest-fedora/build/bitbake/lib/bb/cooker.py",
 line 167, in acquire
bb.warn("\n".join(traceback.format_stack()))

--

___
Python tracker 
<https://bugs.python.org/issue47195>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue47139] pthread_sigmask needs SIG_BLOCK behaviour explaination

2022-04-05 Thread Richard Purdie


Richard Purdie  added the comment:

I think the python code implementing pthread_sigmask already does trigger 
interrupts if any have been queued before the function returns from blocking or 
unblocking.

The key subtlety which I initially missed is that if you have another thread in 
your python script, any interrupt it receives can be raised in the main thread 
whilst you're in the SIGBLOCK section. This obviously isn't what you expect at 
all as those interrupts are supposed to be blocked! It isn't really practical 
to try and SIGBLOCK on all your individual threads.

What I'd wondered is what you mention, specifically checking if a signal is 
masked in the python signal raising code with something like the 
"pthread_sigmask(SIG_UNBLOCK, NULL /* set */, &oldset)" before it raises it and 
if there is blocked, just leave it queued. The current code would trigger the 
interrupts when it was unmasked. This would effectively only apply on the main 
thread where all the signals/interrupts are raised. 

This would certainly give the behaviour that would be expected from the calls 
and save everyone implementing the workarounds as I have. Due to the threads 
issue, I'm not sure SIGBLOCK is actually useful in the real world with the 
current implementation unfortunately.

Equally, if that isn't an acceptable fix, documenting it would definitely be 
good too.

--

___
Python tracker 
<https://bugs.python.org/issue47139>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue47258] Python 3.10 hang at exit in drop_gil() (due to resource warning at exit?)

2022-04-08 Thread Richard Purdie


New submission from Richard Purdie :

We had a python hang at shutdown. The gdb python backtrace and C backtraces are 
below. It is hung in the COND_WAIT(gil->switch_cond, gil->switch_mutex) call in 
drop_gil().

Py_FinalizeEx -> handle_system_exit() -> PyGC_Collect -> handle_weakrefs -> 
drop_gil 

I think from the stack trace it may have been printing the warning:

sys:1: ResourceWarning: unclosed file <_io.TextIOWrapper 
name='/home/pokybuild/yocto-worker/oe-selftest-fedora/build/build-st-1560250/bitbake-cookerdaemon.log'
 mode='a+' encoding='UTF-8'>

however I'm not sure if it was that or trying to show a different exception. 
Even if we have a resource leak, it shouldn't really hang!

(gdb) py-bt
Traceback (most recent call first):
  File "/usr/lib64/python3.10/weakref.py", line 106, in remove
def remove(wr, selfref=ref(self), _atomic_removal=_remove_dead_weakref):
  Garbage-collecting

#0  __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, 
op=393, expected=0, futex_word=0x7f0f7bd54b20 <_PyRuntime+512>) at 
futex-internal.c:57
#1  __futex_abstimed_wait_common (futex_word=futex_word@entry=0x7f0f7bd54b20 
<_PyRuntime+512>, expected=expected@entry=0, clockid=clockid@entry=0, 
abstime=abstime@entry=0x0, 
private=private@entry=0, cancel=cancel@entry=true) at futex-internal.c:87
#2  0x7f0f7b88979f in __GI___futex_abstimed_wait_cancelable64 
(futex_word=futex_word@entry=0x7f0f7bd54b20 <_PyRuntime+512>, 
expected=expected@entry=0, clockid=clockid@entry=0, 
abstime=abstime@entry=0x0, private=private@entry=0) at futex-internal.c:139
#3  0x7f0f7b88beb0 in __pthread_cond_wait_common (abstime=0x0, clockid=0, 
mutex=0x7f0f7bd54b28 <_PyRuntime+520>, cond=0x7f0f7bd54af8 <_PyRuntime+472>) at 
pthread_cond_wait.c:504
#4  ___pthread_cond_wait (cond=cond@entry=0x7f0f7bd54af8 <_PyRuntime+472>, 
mutex=mutex@entry=0x7f0f7bd54b28 <_PyRuntime+520>) at pthread_cond_wait.c:619
#5  0x7f0f7bb388d8 in drop_gil (ceval=0x7f0f7bd54a78 <_PyRuntime+344>, 
ceval2=, tstate=0x558744ef7c10)
at /usr/src/debug/python3.10-3.10.4-1.fc35.x86_64/Python/ceval_gil.h:182
#6  0x7f0f7bb223e8 in eval_frame_handle_pending (tstate=) at 
/usr/src/debug/python3.10-3.10.4-1.fc35.x86_64/Python/ceval.c:1185
#7  _PyEval_EvalFrameDefault (tstate=, f=, 
throwflag=) at 
/usr/src/debug/python3.10-3.10.4-1.fc35.x86_64/Python/ceval.c:1775
#8  0x7f0f7bb19600 in _PyEval_EvalFrame (throwflag=0, 
f=Frame 0x7f0f7a0c8a60, for file /usr/lib64/python3.10/weakref.py, line 
106, in remove (wr=, selfref=, _atomic_removal=), tstate=0x558744ef7c10)
at 
/usr/src/debug/python3.10-3.10.4-1.fc35.x86_64/Include/internal/pycore_ceval.h:46
#9  _PyEval_Vector (tstate=, con=, 
locals=, args=, argcount=1, kwnames=)
at /usr/src/debug/python3.10-3.10.4-1.fc35.x86_64/Python/ceval.c:5065
#10 0x7f0f7bb989a8 in _PyObject_VectorcallTstate (kwnames=0x0, 
nargsf=9223372036854775809, args=0x7fff8b815bc8, callable=, 
tstate=0x558744ef7c10) at 
/usr/src/debug/python3.10-3.10.4-1.fc35.x86_64/Include/cpython/abstract.h:114
#11 PyObject_CallOneArg (func=, 
arg=) at 
/usr/src/debug/python3.10-3.10.4-1.fc35.x86_64/Include/cpython/abstract.h:184
#12 0x7f0f7bb0fce1 in handle_weakrefs (old=0x558744edbd30, 
unreachable=0x7fff8b815c70) at 
/usr/src/debug/python3.10-3.10.4-1.fc35.x86_64/Modules/gcmodule.c:887
#13 gc_collect_main (tstate=0x558744ef7c10, generation=2, 
n_collected=0x7fff8b815d50, n_uncollectable=0x7fff8b815d48, nofail=0)
at /usr/src/debug/python3.10-3.10.4-1.fc35.x86_64/Modules/gcmodule.c:1281
#14 0x7f0f7bb9194e in gc_collect_with_callback 
(tstate=tstate@entry=0x558744ef7c10, generation=generation@entry=2)
at /usr/src/debug/python3.10-3.10.4-1.fc35.x86_64/Modules/gcmodule.c:1413
#15 0x7f0f7bbc827e in PyGC_Collect () at 
/usr/src/debug/python3.10-3.10.4-1.fc35.x86_64/Modules/gcmodule.c:2099
#16 0x7f0f7bbc7bc2 in Py_FinalizeEx () at 
/usr/src/debug/python3.10-3.10.4-1.fc35.x86_64/Python/pylifecycle.c:1781
#17 0x7f0f7bbc7d7c in Py_Exit (sts=0) at 
/usr/src/debug/python3.10-3.10.4-1.fc35.x86_64/Python/pylifecycle.c:2858
#18 0x7f0f7bbc4fbb in handle_system_exit () at 
/usr/src/debug/python3.10-3.10.4-1.fc35.x86_64/Python/pythonrun.c:775
#19 0x7f0f7bbc4f3d in _PyErr_PrintEx (set_sys_last_vars=1, 
tstate=0x558744ef7c10) at 
/usr/src/debug/python3.10-3.10.4-1.fc35.x86_64/Python/pythonrun.c:785
#20 PyErr_PrintEx (set_sys_last_vars=1) at 
/usr/src/debug/python3.10-3.10.4-1.fc35.x86_64/Python/pythonrun.c:880
#21 0x7f0f7bbbcece in PyErr_Print () at 
/usr/src/debug/python3.10-3.10.4-1.fc35.x86_64/Python/pythonrun.c:886
#22 _PyRun_SimpleFileObject (fp=, filename=, 
closeit=1, flags=0x7fff8b815f18) at 
/usr/src/debug/python3.10-3.10.4-1.fc35.x86_64/Python/pythonrun.c:462
#23 0x7f0f7bbbcc57 in _PyRun_AnyFileObject (fp=0x558744ed9370, 
filename='/home/pokybuild/y

[issue41714] multiprocessing.Queue deadlock

2020-09-04 Thread Richard Purdie


New submission from Richard Purdie :

We're having some problems with multiprocessing.Queue where the parent process 
ends up hanging with zombie children. The code is part of bitbake, the task 
execution engine behind OpenEmbedded/Yocto Project.

I've cut down our code to the pieces in question in the attached file. It 
doesn't give a runnable test case unfortunately but does at least show what 
we're doing. Basically, we have a set of items to parse, we create a set of 
multiprocessing.Process() processes to handle the parsing in parallel. Jobs are 
queued in one queue and results are fed back to the parent via another. There 
is a quit queue that takes sentinels to cause the subprocesses to quit.

If something fails to parse, shutdown with clean=False is called, the sentinels 
are sent. the Parser() process calls results.cancel_join_thread() on the 
results queue. We do this since we don't care about the results any more, we 
just want to ensure everyting exits cleanly. This is where things go wrong. The 
Parser processes and their queues all turn into zombies. The parent process 
ends up stuck in self.result_queue.get(timeout=0.25) inside shutdown().

strace shows its acquired the locks and is doing a read() on the os.pipe() it 
created. Unfortunately since the parent still has a write channel open to the 
same pipe, it hangs indefinitely.

If I change the code to do:

self.result_queue._writer.close()
while True:
try:
   self.result_queue.get(timeout=0.25)
except (queue.Empty, EOFError):
break

i.e. close the writer side of the pipe by poking at the queue internals, we 
don't see the hang. The .close() method would close both sides.

We create our own process pool since this code dates from python 2.x days and 
multiprocessing pools had issues back when we started using this. I'm sure it 
would be much better now but we're reluctant to change what has basically been 
working. We drain the queues since in some cases we have clean shutdowns where 
cancel_join_thread() hasn't been used and we don't want those cases to block.

My question is whether this is a known issue and whether there is some kind of 
API to close just the write side of the Queue to avoid problems like this?

--
components: Library (Lib)
files: simplified.py
messages: 376350
nosy: rpurdie
priority: normal
severity: normal
status: open
title: multiprocessing.Queue deadlock
type: crash
versions: Python 3.6
Added file: https://bugs.python.org/file49444/simplified.py

___
Python tracker 
<https://bugs.python.org/issue41714>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41714] multiprocessing.Queue deadlock

2020-09-04 Thread Richard Purdie


Richard Purdie  added the comment:

I should also add that if we don't use cancel_join_thread() in the parser 
processes, things all work out ok. There is therefore seemingly something odd 
about the state that is leaving things in.
This issue doesn't occur every time, its maybe 1 in 40 runs where we throw 
parsing errors but I can brute force reproduce it.

--

___
Python tracker 
<https://bugs.python.org/issue41714>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41714] multiprocessing.Queue deadlock

2020-09-04 Thread Richard Purdie


Richard Purdie  added the comment:

Even my hack to call _writer.close() doesn't seem to be enough, it makes the 
problem rarer but there is still an issue. 
Basically, if you call cancel_join_thread() in one process, the queue is 
potentially totally broken in all other processes that may be using it. If for 
example another has called join_thread() as it was exiting and has queued data 
at the same time as another process exits using cancel_join_thread() and exits 
holding the write lock, you'll deadlock on the processes now stuck in 
join_thread() waiting for a lock they'll never get.
I suspect the answer is "don't use cancel_join_thread()" but perhaps the docs 
need a note to point out that if anything is already potentially exiting, it 
can deadlock? I'm not sure you can actually use the API safely unless you stop 
all users from exiting and synchronise that by other means?

--

___
Python tracker 
<https://bugs.python.org/issue41714>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com