[Python-Dev] Re: Proliferation of tstate arguments.
On 16/03/2020 3:04 pm, Victor Stinner wrote: Hi, Changes on this scale merit a PEP and proper discussion, rather than being added piecemeal without proper review. Last November, I asked explicitly on python-dev if we should "Pass the Python thread state to internal C functions": https://mail.python.org/archives/list/python-dev@python.org/thread/PQBGECVGVYFTVDLBYURLCXA3T7IPEHHO/#Q4IPXMQIM5YRLZLHADUGSUT4ZLXQ6MYY In short, the answer is yes. I said "no" then and gave reasons. AFAICT no one has faulted my reasoning. Let me reiterate why using a thread-local variable is better than passing the thread state down the C stack. 1. Using a thread-local variable for the thread state requires much smaller changes to the code base. 2. Using a thread-local variable is less error prone. When passing tstate as a parameter, what happens if the tstate argument is from a different thread or is NULL? Are you adding checks for those cases? What are the performance implications of adding those checks? 3. Using a thread-local variable is likely to be a little bit faster. Passing an argument down the stack increases register pressure and spills. Accessing a thread-local is slower at the point of access, but the cost is incurred only when it is needed, so is cheaper overall. There is no PEP but scatted documents. I wrote a short article to elaborate the context of this work: https://vstinner.github.io/cpython-pass-tstate.html One motivation is to ease the implementation of subinterpreters (PEP 554). But PEP 554 describes more than public API than the implementation. I don't see how this eases the implementation of subinterpreters. Surely it makes it harder by causing merge conflicts. -- In the meanwhile, I modified "small integer singletons" to make them "per-interpreter". So tstate->interp is now used to get small integers in longobject.c. I also opened a discussion on other singletons (None, True, False, ...): https://bugs.python.org/issue39511 That's great, but I don't see the relevance. The long-term goal is to be able to run multiple isolated interpreters in parallel. An admirable goal, IMO. But how is it to be done? That is the question. Le lun. 16 mars 2020 à 15:16, Mark Shannon a écrit : There seems to be a proliferation of `PyThreadState *tstate` arguments being added to API and internal functions. So far, tstate should only be passed to *internal* C functions. I don't think that the public C API has been modified to pass tstate. These changes are listed under https://bugs.python.org/issue38644. There was also https://bugs.python.org/issue36710 I think that these changes are misguided. The desired results can be achieved more reliably and more simply in other ways. Would you mind to elaborate? Using thread-local variables changes the API less, is more reliable and maybe faster. The changes add bulk to the C-API and may hurt performance. Did you notice that in benchmarks? I would be curious to see the overhead. These changes are also causing a lot of churn and merge conflicts (for me at least). Sorry about that :-/ A lot of Python internals should be modified to implement subinterpreters. I don't think they *should*. When they *must*, then do that. Changes should only be made if necessary for correctness or for a significant improvement of performance. Cheers, Mark. Victor ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/WOKAUQYDJDVRA7SJRJDEAHXTRXSVPNMO/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Proliferation of tstate arguments.
On 16/03/2020 6:21 pm, Victor Stinner wrote: There were quick discussions about using thread local storage (TLS) to get and set the current Python thread state ("tstate"), instead of reading/setting an atomic variable (_PyRuntime.gilstate.tstate_current). In fact, TLS already exists as "PyGILState" and PyGILState_GetThisThreadState() can already return the current thread state. But this API doesn't work currently with subinterpreters: Just to be clear, this is what I mean by a thread local variable: https://godbolt.org/z/dpSo-Q * https://bugs.python.org/issue10915 * https://bugs.python.org/issue15751 It's unclear to me if fixing this issue would require to add a lock, nor if it would make PyGILState_GetThisThreadState() or _PyThreadState_GET() slower. It doesn't require a lock, and it is only two instructions (it's 5 instructions on Windows, but that's still cheap). Moreover, currently, it's possible to have two Python thread state for the same native thread. That's needed for the transition from the main interpreter to new subinterpreter. Example: --- mainstate = PyThreadState_Get(); PyThreadState_Swap(NULL); substate = Py_NewInterpreter(); r = PyRun_SimpleString(code); Py_EndInterpreter(substate); PyThreadState_Swap(mainstate); --- The visibility of a thread local variable is a strict superset of that of a function local variable. Anything you can do with a function local variable, you can do with a thread local variable. where Py_NewInterpreter() creates a new Python thread state using PyThreadState_New() and sets it as the current Python thread state. Maybe the subinterpreter API can evolve to never "attach" two different interpreters to the same thread: one Python thread state would only belong to a single native thread. I would be interested to explore the option of using TLS everywhere, but first we need to solve all these tricky issues. Again, so far, passing tstate explicitly is only done internally, so if we switch to another solution which doesn't require to pass tstate explicitly, we can do that without breaking the public C API. -- I hope that you now have a better overview of the current state of the Python implementation and how it evolves in the last two years. What worries me is the idea that passing the thread state around is the only way to isolate sub-interpreters. It isn't. Using thread-local variables is a better way. There may be even better approaches, but I'm not aware of them. Victor Le lun. 16 mars 2020 à 16:04, Victor Stinner a écrit : Hi, Changes on this scale merit a PEP and proper discussion, rather than being added piecemeal without proper review. Last November, I asked explicitly on python-dev if we should "Pass the Python thread state to internal C functions": https://mail.python.org/archives/list/python-dev@python.org/thread/PQBGECVGVYFTVDLBYURLCXA3T7IPEHHO/#Q4IPXMQIM5YRLZLHADUGSUT4ZLXQ6MYY In short, the answer is yes. There is no PEP but scatted documents. I wrote a short article to elaborate the context of this work: https://vstinner.github.io/cpython-pass-tstate.html One motivation is to ease the implementation of subinterpreters (PEP 554). But PEP 554 describes more than public API than the implementation. -- In the meanwhile, I modified "small integer singletons" to make them "per-interpreter". So tstate->interp is now used to get small integers in longobject.c. I also opened a discussion on other singletons (None, True, False, ...): https://bugs.python.org/issue39511 The long-term goal is to be able to run multiple isolated interpreters in parallel. Le lun. 16 mars 2020 à 15:16, Mark Shannon a écrit : There seems to be a proliferation of `PyThreadState *tstate` arguments being added to API and internal functions. So far, tstate should only be passed to *internal* C functions. I don't think that the public C API has been modified to pass tstate. These changes are listed under https://bugs.python.org/issue38644. There was also https://bugs.python.org/issue36710 I think that these changes are misguided. The desired results can be achieved more reliably and more simply in other ways. Would you mind to elaborate? The changes add bulk to the C-API and may hurt performance. Did you notice that in benchmarks? I would be curious to see the overhead. These changes are also causing a lot of churn and merge conflicts (for me at least). Sorry about that :-/ A lot of Python internals should be modified to implement subinterpreters. Victor -- Night gathers, and now my watch begins. It shall not end until my death. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/7T3EP7TB6KI2SMLWXRQHXMXWQMFVIJHG/ Code of Conduc
[Python-Dev] Re: Proliferation of tstate arguments.
On Tue, 17 Mar 2020 14:47:19 + Mark Shannon wrote: > On 16/03/2020 3:04 pm, Victor Stinner wrote: > > Hi, > > > >> Changes on this scale merit a PEP and proper discussion, rather than > >> being added piecemeal without proper review. > > > > Last November, I asked explicitly on python-dev if we should "Pass the > > Python thread state to internal C functions": > > https://mail.python.org/archives/list/python-dev@python.org/thread/PQBGECVGVYFTVDLBYURLCXA3T7IPEHHO/#Q4IPXMQIM5YRLZLHADUGSUT4ZLXQ6MYY > > > > In short, the answer is yes. > > I said "no" then and gave reasons. AFAICT no one has faulted my reasoning. > > Let me reiterate why using a thread-local variable is better than > passing the thread state down the C stack. > > 1. Using a thread-local variable for the thread state requires much > smaller changes to the code base. > > 2. Using a thread-local variable is less error prone. When passing > tstate as a parameter, what happens if the tstate argument is from a > different thread or is NULL? Are you adding checks for those cases? > What are the performance implications of adding those checks? > > 3. Using a thread-local variable is likely to be a little bit faster. > Passing an argument down the stack increases register pressure and spills. > Accessing a thread-local is slower at the point of access, but the cost > is incurred only when it is needed, so is cheaper overall. The problem here is if different subinterpreters create a thread state for the same OS thread. Sounds unlikely, perhaps, but not entirely, especially if that thread state is created from a foreign C library's callback. Regards Antoine. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/7IZMNVXEJCY5RCIDIEK4CADPW7UPGSGR/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Proliferation of tstate arguments.
On 17Mar2020 1447, Mark Shannon wrote: On 16/03/2020 3:04 pm, Victor Stinner wrote: In short, the answer is yes. I said "no" then and gave reasons. AFAICT no one has faulted my reasoning. I said "yes" then and was also not faulted. Let me reiterate why using a thread-local variable is better than passing the thread state down the C stack. 1. Using a thread-local variable for the thread state requires much smaller changes to the code base. Using thread-local variables enforces a threading model on the host application, rather than working with the existing threading model. So anyone embedding is forced into *significantly* more code as a result. We can (and should) maintain a public-facing API that uses TLS to pass the current thread state around - we have compatibility constraints. But we can also add private versions that take the thread state (once you've started trying/struggling to really embed CPython, you'll happily take a dependency on "private" APIs). If the only available API requires TLS, then you're likely to see the caller wrap it all up in a function that updates TLS before calling. Or alternatively, introduce dedicated threads for running Python snippets on, and all the (dead)locking that results (yes, I've done both). Our goal as core CPython developers should be to sacrifice our own effort to reduce the effort needed by our users, not to do things that make our own lives easier but harm them. 2. Using a thread-local variable is less error prone. When passing tstate as a parameter, what happens if the tstate argument is from a different thread or is NULL? Are you adding checks for those cases? What are the performance implications of adding those checks? Undefined behaviour is totally acceptable here. We can assert in debug builds - developers who make use of this can test with debug builds. 3. Using a thread-local variable is likely to be a little bit faster. Passing an argument down the stack increases register pressure and spills. Accessing a thread-local is slower at the point of access, but the cost is incurred only when it is needed, so is cheaper overall. Compilers can optimise parameters/locals in ways that are far more efficient than they can do for anything stored outside the call stack. Especially for internal calls. Going through public/exported functions is a little more restricted in terms of optimisations, but if we identify an issue here then we can work on that then. [OTHER POST] Just to be clear, this is what I mean by a thread local variable: https://godbolt.org/z/dpSo-Q Showing what one particular compiler generates for one particular situation is terrible information (I won't bother calling it "evidence"). One motivation is to ease the implementation of subinterpreters (PEP 554). But PEP 554 describes more than public API than the implementation. I don't see how this eases the implementation of subinterpreters. Surely it makes it harder by causing merge conflicts. That's a very selfish point-of-view :) It eases it because many more operations need to know the current Python "thread" in order to access things that used to be globals, such as PyTypeObject instances. Having the thread state easily and efficiently accessible does make a difference here. The long-term goal is to be able to run multiple isolated interpreters in parallel. An admirable goal, IMO. But how is it to be done? That is the question. By isolating thread states properly from global state. Sorry about that :-/ A lot of Python internals should be modified to implement subinterpreters. I don't think they *should*. When they *must*, then do that. Changes should only be made if necessary for correctness or for a significant improvement of performance. They must - I think Victor just chose the wrong English word there. Correctness is the first thing to fall when you access globals in multithreaded code, and the CPython code base accesses a lot of globals. Cheers, Steve ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/5FB5PNH2H72ZAAN7TEDR3NS45H34ELEY/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Proliferation of tstate arguments.
Le mar. 17 mars 2020 à 15:47, Mark Shannon a écrit : > > There is no PEP but scatted documents. I wrote a short article to > > elaborate the context of this work: > > https://vstinner.github.io/cpython-pass-tstate.html > > > > One motivation is to ease the implementation of subinterpreters (PEP > > 554). But PEP 554 describes more than public API than the > > implementation. > > I don't see how this eases the implementation of subinterpreters. The overall work helps to identify what is currently shared by subinterpreters and prepares the code base to "unshare" data. For me, it's more explicit that _PyErr_Occurred(tstate) gets the current exception from the tstate argument, than PyErr_Occurred() which implicitly gets the current Python thread state. Before the work on subinterpreters started, it was really hard to guess where all the states come from and if anything was shared. I mean, tons of states were shared. Just because it's easier to implement like that! Eric Snow started by moving many core states into _PyRuntimeState or PyInterpreterState between Python 3.7 and Python 3.8. Previously, many functions had an implicit state which was shared by all interpreters. Now, it's still shared, but it becomes easier to reorganize the code to have one state per interpreter. For example, that's how I managed to move the state of the gc module per-interpreter: https://bugs.python.org/issue36854 To be able to run two interpreters in parallel, it's very important to not share anything for best performances but also for correctness. Concurrent accesses of shared states with no locking mecasim can introduce inconsistencies. One solution for that is to not share anything ;-) > > I also opened a discussion on other singletons (None, True, False, > > ...): https://bugs.python.org/issue39511 > > That's great, but I don't see the relevance. I proposed a fix which moves None and True singletons into PyInterpreterState. If we choose this solution, it would mean that more and more functions will have to pass tstate (ex: none = tstate->interp->none). But let's discuss the details in https://bugs.python.org/issue39511 Other solutions have been proposed ;-) > > The long-term goal is to be able to run multiple isolated interpreters > > in parallel. > > An admirable goal, IMO. > But how is it to be done? That is the question. Passing states explicitly in one task. Another task is to convert modules to the multiphase initialization PEP 489. It allows to have multiple isolated instances of the same C extension module, one per subinterpreter. No more shared module state! Eric Snow tracks the different tasks under the https://github.com/ericsnowcurrently/multi-core-python/ project. > Changes should only be made if necessary for correctness or for a > significant improvement of performance. It's for correctness ;-) Victor -- Night gathers, and now my watch begins. It shall not end until my death. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/TDTN7SRXND3Z4B2L5XYAP4COK6JP4TBD/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Proliferation of tstate arguments.
Le mar. 17 mars 2020 à 15:49, Mark Shannon a écrit : > > * https://bugs.python.org/issue10915 > > * https://bugs.python.org/issue15751 > > > > It's unclear to me if fixing this issue would require to add a lock, > > nor if it would make PyGILState_GetThisThreadState() or > > _PyThreadState_GET() slower. > > It doesn't require a lock, and it is only two instructions (it's 5 > instructions on Windows, but that's still cheap). I'm not sure that I understand what you mean. Do you have a fix for these issues, to be able to use this API in subinterpreters? I asked if fixing these issues may need to add another lock or atomic variable. Victor -- Night gathers, and now my watch begins. It shall not end until my death. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/YCF4JK6GIM6M7ZVBLTA4E4NFFWDVDIBJ/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Proliferation of tstate arguments.
On 17/03/2020 3:38 pm, Steve Dower wrote: On 17Mar2020 1447, Mark Shannon wrote: On 16/03/2020 3:04 pm, Victor Stinner wrote: In short, the answer is yes. I said "no" then and gave reasons. AFAICT no one has faulted my reasoning. I said "yes" then and was also not faulted. I'll do that now then ;) The accessibility of a thread-local variable is a strict superset of that of a function-local variable. Therefore storing the thread state in a thread-local variable is at least as capable as passing thread-state as a parameter. Let me reiterate why using a thread-local variable is better than passing the thread state down the C stack. 1. Using a thread-local variable for the thread state requires much smaller changes to the code base. Using thread-local variables enforces a threading model on the host application, rather than working with the existing threading model. So anyone embedding is forced into *significantly* more code as a result. Putting a value in a function-local variable enforces stronger restrictions than putting it in a thread-local variable. I am proposing that we *don't* change the API. How does that make more work for anyone using the API? We can (and should) maintain a public-facing API that uses TLS to pass the current thread state around - we have compatibility constraints. But we can also add private versions that take the thread state (once you've started trying/struggling to really embed CPython, you'll happily take a dependency on "private" APIs). Again, I am requesting that we *don't* change the API. Not changing the API maintains backwards compatibility better than changing it, surely. If the only available API requires TLS, then you're likely to see the caller wrap it all up in a function that updates TLS before calling. Or alternatively, introduce dedicated threads for running Python snippets on, and all the (dead)locking that results (yes, I've done both). All the platforms that we support have thread-local storage. If a platform doesn't have threads at all, then thread-local just degenerates to a global. Our goal as core CPython developers should be to sacrifice our own effort to reduce the effort needed by our users, not to do things that make our own lives easier but harm them. Indeed. We might want to speed Python up a bit as well :) 2. Using a thread-local variable is less error prone. When passing tstate as a parameter, what happens if the tstate argument is from a different thread or is NULL? Are you adding checks for those cases? What are the performance implications of adding those checks? Undefined behaviour is totally acceptable here. We can assert in debug builds - developers who make use of this can test with debug builds. I'm not sure what your point is about undefined behaviour. 3. Using a thread-local variable is likely to be a little bit faster. Passing an argument down the stack increases register pressure and spills. Accessing a thread-local is slower at the point of access, but the cost is incurred only when it is needed, so is cheaper overall. Compilers can optimise parameters/locals in ways that are far more efficient than they can do for anything stored outside the call stack. Especially for internal calls. Going through public/exported functions is a little more restricted in terms of optimisations, but if we identify an issue here then we can work on that then. Please skip the patronizing "how compilers work" stuff. I know how register allocators work. [OTHER POST] Just to be clear, this is what I mean by a thread local variable: https://godbolt.org/z/dpSo-Q Showing what one particular compiler generates for one particular situation is terrible information (I won't bother calling it "evidence"). The particular situation is the use of a thread-local variable, which is the point under discussion. Here's the links for clang and MSVC: https://godbolt.org/z/YnbbqD https://www.godbolt.ms/z/9nQEqf One motivation is to ease the implementation of subinterpreters (PEP 554). But PEP 554 describes more than public API than the implementation. I don't see how this eases the implementation of subinterpreters. Surely it makes it harder by causing merge conflicts. That's a very selfish point-of-view :) Why? Merge conflicts are a problem for everyone. It eases it because many more operations need to know the current Python "thread" in order to access things that used to be globals, such as PyTypeObject instances. Having the thread state easily and efficiently accessible does make a difference here. Indeed. I want it to be easily and efficiently accessible. Putting it a thread-local variable does both. For additional efficiency `_PyThreadState_GET()` can be defined as `inline`. The long-term goal is to be able to run multiple isolated interpreters in parallel. An admirable goal, IMO. But how is it to be done? That is the question. By isolating thr
[Python-Dev] Changing layout of f_localsplus in frame objects
(Apologies if you're seeing this twice. I first posted to the discourse instance.) I first worked on a register-based virtual machine in the 1.5.2 timeframe. That was before PEP 227 (closures) landed. Prior to that, local variables and the stack were contiguous in the f_localsplus array. Based on a comment at the time from Tim Peters (which I recall but can no longer find), the number of required registers won’t be any greater than the maximum extent of the stack. (It makes sense, and worked for me at the time.) That allowed me to repurpose the stack space as my registers and treat the full f_localsplus array as a single large “register file,” eliminating LOAD_FAST_REG and STORE_FAST_REG instructions completely, which I believe was a major win. I want to rearrange the current f_localsplus to make locals and stack neighbors again. My first look at the code suggested there is no good reason it can’t be done, but figured Jeremy Hylton must have had a good reason to prefer the current layout which places cells and frees in the middle. While the necessary changes didn’t look extensive, they did look a bit tedious to get right, which I have confirmed. My first attempt to reorganize f_localsplus from locals/cells/frees/stack to locals/stack/cells/frees has been an abysmal failure. I’ve found a couple mistakes and corrected them. Implementing those corrections caused the types of failures to change (convincing me they were necessary), but did not eliminate them entirely (so, necessary, but not sufficient). I’ve clearly missed something. I’m also fairly ignorant about recent changes to the language (big understatement), so thought that before going any further, I would see if anyone with better current knowledge of frame objects knew of a reason why my desired layout change wouldn’t work. Skip ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/NWQKIDC2IZXJZFQPJDYAWIWNKYUHCSH3/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Proliferation of tstate arguments.
On Wed, Mar 18, 2020 at 3:50 AM Mark Shannon wrote: > The accessibility of a thread-local variable is a strict superset of > that of a function-local variable. > > Therefore storing the thread state in a thread-local variable is at > least as capable as passing thread-state as a parameter. > And by that logic, globals are even more capable. I don't understand your point. Isn't the purpose of the tstate parameters to avoid the problem of being unable to have multiple tstates within the same OS thread? I think I've missed something here. ChrisA ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/NLDQYOZYFDDXR73MR4XSLKN5NWZPSLWE/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Proliferation of tstate arguments.
On 17Mar2020 1803, Chris Angelico wrote: On Wed, Mar 18, 2020 at 3:50 AM Mark Shannon wrote: The accessibility of a thread-local variable is a strict superset of that of a function-local variable. Therefore storing the thread state in a thread-local variable is at least as capable as passing thread-state as a parameter. And by that logic, globals are even more capable. I don't understand your point. Isn't the purpose of the tstate parameters to avoid the problem of being unable to have multiple tstates within the same OS thread? I think I've missed something here. You haven't. Separating the Python thread from the "physical" thread is indeed the point. Cheers, Steve ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/GJDHY2AEF2BSBMNWZRP5IXVEGSWB6IF4/ Code of Conduct: http://python.org/psf/codeofconduct/