Re: [Numpy-discussion] numpy threads crash when allocating arrays

Burlen Loring Tue, 14 Jun 2016 10:34:40 -0700

Nathaniel, Thanks for the feedback. Investigations made where I acquireGIL in spots where I already hold it lead me to suspect that the issueis caused by use of Py_BEGIN_ALLOW_THREADS or another macroNPY_BEGIN_ALLOW_THREADS.

to put this point to rest, I have a high degree of confidence in ourcode. I have run the code in valgrind, and it runs cleanly when I usePython's valgrind suppression file. It is certainly the case the theerror is harder to reproduce with valgrind, and it runs so slowly undervalgrind that running in a bash loop (the usual way I reproduce) is notpractical. I have examined crashes with gdb and I have verified thatthey occur while I hold the GIL.

However, based on your suggestion, I experimented with acquiring GIL incertain spots in my code where it should not be necessary either becausethose code pieces are invoked by the main thread when there is only onethread running, or because the code in question is invoked while I'malready holding the GIL. I also added trace output to stderr in thosespots as well.

Acquiring the GIL caused deadlocks as expected, except when I used it innew_object template (the only spot we pass data to numpy) below.Aquiring the GIL here fixes the issues!


   206 //
   ****************************************************************************
   207 template <typename NT>
   208 PyArrayObject *new_object(teca_variant_array_impl<NT> *varrt)
   209 {
   *210     PyGILState_STATE gstate;**// experimental, I already hold
   the GIL higher up in stack
   **211     gstate = PyGILState_Ensure();*
   212     TECA_STATUS("teca_py_array::new_object");
   213
   214     // allocate a buffer
   215     npy_intp n_elem = varrt->size();
   216     size_t n_bytes = n_elem*sizeof(NT);
   217     NT *mem = static_cast<NT*>(malloc(n_bytes));
   218     if (!mem)
   219     {
   220         PyErr_Format(PyExc_RuntimeError,
   221             "failed to allocate %lu bytes", n_bytes);
   222         return nullptr;
   223     }
   224
   225     // copy the data
   226     memcpy(mem, varrt->get(), n_bytes);
   227
   228     // put the buffer in to a new numpy object
   229     PyArrayObject *arr = reinterpret_cast<PyArrayObject*>(
   230         PyArray_SimpleNewFromData(1, &n_elem,
   numpy_tt<NT>::code, mem));
   231     PyArray_ENABLEFLAGS(arr, NPY_ARRAY_OWNDATA);
   232
   *233     PyGILState_Release(gstate);*
   234     return arr;
   235 }

now, this function is only used from within the user provided Pythoncallback which is invoked only while I'm holding the GIL. A fact thatI've verified via gdb stack traces. This function should be runningserially due to the fact that I hold the GIL. However, it's runningconcurrently, as evidenced from the random crashes when it's used, andthe garbled stderr output, both of which go away with the aboveaddition. I don't think that I should have to acquire the GIL here, butthe evidence is against me!

In Python docs on the GIL, I noticed that there's a macroPy_BEGIN_ALLOW_THREADS that Python uses internally around blocking I/Oand heavy computations. much to my surprise grep revealsPy_BEGIN_ALLOW_THREADS is used in numpy. I think this can explain theissues I'm experiencing. where numpy uses Py_BEGIN_ALLOW_THREADS itwould let my threads, who utilize the numpy C-API via new_objecttemplate above, run concurrently despite the fact that I hold the GIL.It seems to me that in order for numpy to be thread safe it should notuse this macro at all. At least this is my theory, I haven't yet had achance to modify numpy build to verify. It's possible thatPy_BEGIN_ALLOW_THREADS maybe used elsewhere.

here's my question: given Py_BEGIN_ALLOW_THREADS is used by numpy howcan numpy be thread safe? and how can someone using the C-API know whereit's necessary to acquire the GIL? Maybe someone can explain this?


   $grep BEGIN_ALLOW_THREADS ./ -rIn
   ./doc/source/f2py/signature-file.rst:250:      Use
   ``Py_BEGIN_ALLOW_THREADS .. Py_END_ALLOW_THREADS`` block
   ./doc/source/reference/c-api.array.rst:3092:    .. c:macro::
   NPY_BEGIN_ALLOW_THREADS
   ./doc/source/reference/c-api.array.rst:3094:        Equivalent to
   :c:macro:`Py_BEGIN_ALLOW_THREADS` except it uses
   ./build/src.linux-x86_64-2.7/numpy/core/src/multiarray/scalartypes.c:2109:
   Py_BEGIN_ALLOW_THREADS
   ./build/src.linux-x86_64-2.7/numpy/core/src/multiarray/scalartypes.c:2125:
   Py_BEGIN_ALLOW_THREADS
   ./build/src.linux-x86_64-2.7/numpy/core/src/multiarray/scalartypes.c:2143:
   Py_BEGIN_ALLOW_THREADS
   ./build/src.linux-x86_64-2.7/numpy/core/src/multiarray/scalartypes.c:2159:
   Py_BEGIN_ALLOW_THREADS
   ./build/src.linux-x86_64-2.7/numpy/core/src/multiarray/scalartypes.c:2177:
   Py_BEGIN_ALLOW_THREADS
   ./build/src.linux-x86_64-2.7/numpy/core/src/multiarray/scalartypes.c:2193:
   Py_BEGIN_ALLOW_THREADS
   ./build/src.linux-x86_64-2.7/numpy/core/src/multiarray/scalartypes.c:2211:
   Py_BEGIN_ALLOW_THREADS
   ./build/src.linux-x86_64-2.7/numpy/core/src/multiarray/scalartypes.c:2227:
   Py_BEGIN_ALLOW_THREADS
   ./build/lib.linux-x86_64-2.7/numpy/f2py/rules.py:419: {isthreadsafe:
   '\t\t\tPy_BEGIN_ALLOW_THREADS'},
   ./build/lib.linux-x86_64-2.7/numpy/f2py/rules.py:457: {isthreadsafe:
   '\tPy_BEGIN_ALLOW_THREADS'},
   ./build/lib.linux-x86_64-2.7/numpy/f2py/rules.py:495: {isthreadsafe:
   '\tPy_BEGIN_ALLOW_THREADS'},
   ./build/lib.linux-x86_64-2.7/numpy/f2py/rules.py:541: {isthreadsafe:
   '\tPy_BEGIN_ALLOW_THREADS'},
   ./build/lib.linux-x86_64-2.7/numpy/f2py/rules.py:583: {isthreadsafe:
   '\t\tPy_BEGIN_ALLOW_THREADS'},
   ./numpy/f2py/rules.py:419:            {isthreadsafe:
   '\t\t\tPy_BEGIN_ALLOW_THREADS'},
   ./numpy/f2py/rules.py:457:            {isthreadsafe:
   '\tPy_BEGIN_ALLOW_THREADS'},
   ./numpy/f2py/rules.py:495:            {isthreadsafe:
   '\tPy_BEGIN_ALLOW_THREADS'},
   ./numpy/f2py/rules.py:541:            {isthreadsafe:
   '\tPy_BEGIN_ALLOW_THREADS'},
   ./numpy/f2py/rules.py:583: {isthreadsafe: '\t\tPy_BEGIN_ALLOW_THREADS'},
   ./numpy/fft/fftpack_litemodule.c:45: Py_BEGIN_ALLOW_THREADS;
   ./numpy/fft/fftpack_litemodule.c:98: Py_BEGIN_ALLOW_THREADS;
   ./numpy/fft/fftpack_litemodule.c:135: Py_BEGIN_ALLOW_THREADS;
   ./numpy/fft/fftpack_litemodule.c:191: Py_BEGIN_ALLOW_THREADS;
   ./numpy/fft/fftpack_litemodule.c:254: Py_BEGIN_ALLOW_THREADS;
   ./numpy/fft/fftpack_litemodule.c:295:  Py_BEGIN_ALLOW_THREADS;
   ./numpy/core/src/multiarray/ctors.c:3237: NPY_BEGIN_ALLOW_THREADS;
   ./numpy/core/src/multiarray/ctors.c:3277: NPY_BEGIN_ALLOW_THREADS;
   ./numpy/core/src/multiarray/cblasfuncs.c:378: NPY_BEGIN_ALLOW_THREADS;
   ./numpy/core/src/multiarray/cblasfuncs.c:525: NPY_BEGIN_ALLOW_THREADS;
   ./numpy/core/src/multiarray/cblasfuncs.c:549: NPY_BEGIN_ALLOW_THREADS
   ./numpy/core/src/multiarray/cblasfuncs.c:576: NPY_BEGIN_ALLOW_THREADS
   ./numpy/core/src/multiarray/cblasfuncs.c:628: NPY_BEGIN_ALLOW_THREADS;
   ./numpy/core/src/multiarray/cblasfuncs.c:761: NPY_BEGIN_ALLOW_THREADS;
   ./numpy/core/src/multiarray/compiled_base.c:157:
   NPY_BEGIN_ALLOW_THREADS;
   ./numpy/core/src/multiarray/compiled_base.c:181:
   NPY_BEGIN_ALLOW_THREADS;
   ./numpy/core/src/multiarray/compiled_base.c:473:
   NPY_BEGIN_ALLOW_THREADS;
   ./numpy/core/src/multiarray/compiled_base.c:810:
   NPY_BEGIN_ALLOW_THREADS;
   ./numpy/core/src/multiarray/compiled_base.c:1014:
   NPY_BEGIN_ALLOW_THREADS;
   ./numpy/core/src/multiarray/compiled_base.c:1049:
   NPY_BEGIN_ALLOW_THREADS;
   ./numpy/core/src/multiarray/scalartypes.c.src:939:
   Py_BEGIN_ALLOW_THREADS
   ./numpy/core/src/multiarray/scalartypes.c.src:955:
   Py_BEGIN_ALLOW_THREADS
   ./numpy/core/src/multiarray/convert.c:98: NPY_BEGIN_ALLOW_THREADS;
   ./numpy/core/src/multiarray/convert.c:210: NPY_BEGIN_ALLOW_THREADS;
   ./numpy/core/src/multiarray/multiarraymodule.c:3981:
   Py_BEGIN_ALLOW_THREADS;
   ./numpy/core/include/numpy/ndarraytypes.h:932:#define
   NPY_BEGIN_ALLOW_THREADS Py_BEGIN_ALLOW_THREADS
   ./numpy/core/include/numpy/ndarraytypes.h:952:#define
   NPY_BEGIN_ALLOW_THREADS

On 06/13/2016 07:07 PM, Nathaniel Smith wrote:

Hi Burlen,

On Jun 13, 2016 5:24 PM, "Burlen Loring" <blor...@lbl.gov> wrote:

Hi All,

I'm working on a threaded pipeline where we want the end user to be able to 
code up Python functions to do numerical work. Threading is all done in C++11 
and in each thread we've acquired gill before we invoke the user provided 
Python callback and release it only when the callback returns. We've used SWIG 
to expose bindings to C++ objects in Python.

When run with more than 1 thread I get inconsistent segv's in various numpy 
routines, and occasionally see a *** Reference count error detected: an attempt 
was made to deallocate 11 (f) ***.

To pass data from C++ to the Python callback as numpy array we have

206 // 
****************************************************************************
207 template <typename NT>
208 PyArrayObject *new_object(teca_variant_array_impl<NT> *varrt)
209 {
210     // allocate a buffer
211     npy_intp n_elem = varrt->size();
212     size_t n_bytes = n_elem*sizeof(NT);
213     NT *mem = static_cast<NT*>(malloc(n_bytes));
214     if (!mem)
215     {
216         PyErr_Format(PyExc_RuntimeError,
217             "failed to allocate %lu bytes", n_bytes);
218         return nullptr;
219     }
220
221     // copy the data
222     memcpy(mem, varrt->get(), n_bytes);
223
224     // put the buffer in to a new numpy object
225     PyArrayObject *arr = reinterpret_cast<PyArrayObject*>(
226         PyArray_SimpleNewFromData(1, &n_elem, numpy_tt<NT>::code, mem));
227     PyArray_ENABLEFLAGS(arr, NPY_ARRAY_OWNDATA);
228
229     return arr;
230 }

This code would probably be much simpler if you let numpy allocate the
buffer with PyArray_SimpleNew and then did the memcpy. I doubt that's
your problem, though. Numpy should be assuming that "owned" data was
allocated using malloc(), and if it were using a different allocator
then I think you'd be seeing crashes much sooner.

This is the only place we create numpy objects in the C++ side.

In my demo the Python callback is as follows:

  33 def get_execute(rank, var_names):
  34     def execute(port, data_in, req):
  35         sys.stderr.write('descriptive_stats::execute MPI %d\n'%(rank))
  36
  37         mesh = as_teca_cartesian_mesh(data_in[0])
  38
  39         table = teca_table.New()
  40         table.copy_metadata(mesh)
  41
  42         table.declare_columns(['step','time'], ['ul','d'])
  43         table << mesh.get_time_step() << mesh.get_time()
  44
  45         for var_name in var_names:
  46
  47             table.declare_columns(['min '+var_name, 'avg '+var_name, \
  48                 'max '+var_name, 'std '+var_name, 'low_q '+var_name, \
  49                 'med '+var_name, 'up_q '+var_name], ['d']*7)
  50
  51             var = mesh.get_point_arrays().get(var_name).as_array()
  52
  53             table << float(np.min(var)) << float(np.average(var)) \
  54                 << float(np.max(var)) << float(np.std(var)) \
  55                 << map(float, np.percentile(var, [25.,50.,75.]))
  56
  57         return table
  58     return execute

this callback is the only spot where numpy is used. the as_array call is 
implemented by new_object template above.
Further, If I remove our use of PyArray_SimpleNewFromData, by replacing line 51 
in the Python code above with var = np.array(range(1, 1100), 'f'), the problem 
disappears. It must have something to do with use of PyArray_SimpleNewFromData.

I'm at a loss to see why things are going south. I'm using the GIL and I 
thought that would serialize the Python code. I suspect that numpy is using 
global or static variables some where internally and that it's inherently 
thread unsafe. Can anyone confirm/deny? maybe point me in the right direction?

Numpy does use global/static variables, and it is unsafe to call into
numpy simultaneously from different threads. But that's ok, because
you're not allowed to call numpy functions simultaneously from
different threads -- you have to hold the GIL first, and that
serializes access to all of numpy's internal state. Numpy is very
commonly used in threaded code and most people aren't seeing random
segfaults, so the problem is most likely in your code. Sorry I can't
help much more than that... I guess I'd start by triple-checking that
the code really truly does hold the GIL every time that it calls into
numpy/python APIs. I'd also try running it under valgrind in case it's
some other random memory corruption that's just showing up in a weird
way.

-n
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] numpy threads crash when allocating arrays

Reply via email to