Hi Michal, > On 14 May 2023, at 17:31, Michal Jankovič via Gcc-patches > <gcc-patches@gcc.gnu.org> wrote:
> I do not currently have metrics for this, but I can look into generating > them, To be clear, this is not in any way a precondition for patch acceptance but I am curious as to some idea of the improvements seen. > however > I currently do not know of any large open-source projects > using coroutines that I could use for this; I was thinking about using > cppcoro unit tests, but they mostly contain very simple coroutines. Even cppcoro would be some idea. Larger OSS projects I know of are folly (which I have out of tree patches to enable the coroutines tests) and seastar (which I have no idea how to build at present). If you are interested to try building folly (note: it has a lot of prereqs) then I can push my WIP branch somewhere. > I > have source for ~20k LoC proprietary project relying heavily on > coroutines (using boost::asio::awaitable), but here I cannot show the > source along with the numbers - would this be enough or should I look > for more open source projects? The internal project figures are also useful (IMO) even tho we cannot obviously repeat them. thanks Iain > > thanks, > Michal > > On May 14 2023, at 6:07 pm, Iain Sandoe <i...@sandoe.co.uk> wrote: > >> Hi Michal, >> >>> On 14 May 2023, at 16:36, Michal Jankovič >>> <michal.jankovi...@gmail.com> wrote: >>> >>> Rebased the patch to GCC 14 trunk. Bootstrapped and regression tested >>> again on x86_64-pc-linux-gnu, only difference is the new test failing >>> without the patch. >> >> (as previously noted, I am much in favour of this optimisation) >> >> Do you have any metrics on the reductions in frame size for realistic >> coroutines? >> >> thanks >> Iain >> >>> >>> On Jul 13 2022, at 2:54 pm, Michal Jankovic >>> <michal.jankovi...@gmail.com> wrote: >>> >>>> Hi Iain, >>>> >>>> thanks for the info. I have some follow-up questions. >>>> >>>> On Jul 12 2022, at 7:11 pm, Iain Sandoe <i...@sandoe.co.uk> wrote: >>>> >>>>> Hi Michal, >>>>> >>>>>> On 12 Jul 2022, at 16:14, Michal Jankovič >>>>>> <michal.jankovi...@gmail.com> wrote: >>>>> >>>>>> One other related thing I would like to investigate is reducing the >>>>>> number of compiler generated variables in the frame, particularly >>>>>> _Coro_destroy_fn and _Coro_self_handle. >>>>>> >>>>>> As I understand it, _Coro_destroy_fn just sets a flag in >>>>>> _Coro_resume_index and calls _Coro_resume_fn; it should be >>>>>> possible to >>>>>> move this logic to __builtin_coro_destroy, so that only _Coro_resume_fn >>>>>> is stored in the frame; >>>>> >>>>> That is a particular point about GCC’s implementation … (it is not >>>>> neccesarily, or even >>>>> likely to be the same for other implementations) - see below. >>>>> >>>>> I was intending to do experiment with making the ramp/resume/destroy >>>>> value a parameter >>>>> to the actor function so that we would have something like - >>>>> >>>>> ramp calls actor(frame, 0) >>>>> resume calls actor(frame, 1) >>>>> destroy calls actor(frame, 2) >>>>> - the token values are illustrative, not intended to be a final version. >>>>> >>>>> I think that should allow for more inlining opportunites and possibly >>>>> a way forward to >>>>> frame elision (a.k.a halo). >>>>> >>>>>> this would however change the coroutine ABI - I don't know if that's >>>>>> a problem. >>>>> >>>>> The external ABI for the coroutine is the >>>>> resume, >>>>> destroy pointers >>>>> and the promise >>>>> and that one can find each of these from the frame pointer. >>>>> >>>>> This was agreed between the interested “vendors” so that one compiler >>>>> could invoke >>>>> coroutines built by another. So I do not think this is so much a >>>>> useful area to explore. >>>>> >>>> >>>> I understand. I still want to try to implement a more light-weight frame >>>> layout with just one function pointer; would it be possible to merge >>>> such a change if it was made opt-in via a compiler flag, eg >>>> `-fsmall-coroutine-frame`? My use-case for this is embedded environments >>>> with very limited memory, and I do not care about interoperability with >>>> other compilers there. >>>> >>>>> Also the intent is that an indirect call through the frame pointer is >>>>> the most frequent >>>>> operation so should be the most efficient. >>>>> resume() might be called many times, >>>>> destroy() just once thus it is a cold code path >>>>> - space can be important too - but interoperability was the goal here. >>>>> >>>>>> The _Coro_self_handle should be constructible on-demand from the >>>>>> frame address. >>>>> >>>>> Yes, and in the header the relevant items are all constexpr - so that >>>>> should happen in the >>>>> user’s code. I elected to have that value in the frame to avoid >>>>> recreating it each time - I >>>>> suppose that is a trade-off of one oiptimisation c.f. another … >>>> >>>> If the handle construction cannot be optimized out, and its thus >>>> a tradeoff between frame size and number of instructions, then this >>>> could also be enabled by a hypothetical `-fsmall-coroutine-frame`. >>>> >>>> Coming back to this: >>>> >>>>>>> (the other related optimisation is to eliminate frame entries for >>>>>>> scopes without any suspend >>>>>>> points - which has the potential to save even more space for code with >>>>>>> sparse use of co_xxxx) >>>> >>>> This would be nice; although it could encompassed by a more general >>>> >>>> optimization - eliminate frame entries for all variables which are not >>>> >>>> accessed (directly or via pointer / reference) beyond a suspend point. >>>> To be fair, I do not know how to get started on such an optimization, >>>> or if it is even possible to do on the frontend. This would however be >>>> immensely useful for reducing the frame size taken-up by complicated >>>> co_await expressions (among other things), for example, if I have a >>>> composed operation: >>>> >>>> co_await when_either(get_leaf_awaitable_1(), get_leaf_awaitable_2()); >>>> >>>> Right now, this creates space in the frame for the temporary 'leaf' >>>> >>>> awaitables, which were already moved into the composed awaitable. >>>> If the awaitable has an operator co_await that returns the real awaiter, >>>> the original awaitable is also stored in the frame, even if it >>>> is not referenced by the awaiter; another unused object gets stored if >>>> >>>> the .await_transform() customization point was used. >>>> >>>> What are your thoughts on the feasibility / difficulty of implementing >>>> such an optimization? >>>> >>>> Michal >>>> >>>>>> >>>>>> Do you have any advice / opinions on this before I try to >>>>>> implement it? >>>>> >>>>> Hopefully, the notes above help. >>>>> >>>>> I will rebase my latest code changes as soon as I have a chance and >>>>> put them somewhere >>>>> for you to look at - basically, these are to try and address the >>>>> correctness issues we face, >>>>> >>>>> Iain >>>>> >>>>> >>>>>> >>>>>> Michal >>>>>> >>>>>> On Jul 12 2022, at 4:08 pm, Iain Sandoe <i...@sandoe.co.uk> wrote: >>>>>> >>>>>>> Hi Michal, >>>>>>> >>>>>>>> On 12 Jul 2022, at 14:35, Michal Jankovič via Gcc-patches >>>>>>>> <gcc-patches@gcc.gnu.org> wrote: >>>>>>>> >>>>>>>> Currently, coroutine frames store all variables of a coroutine >>>>>>>> separately, >>>>>>>> even if their lifetime does not overlap (they are in distinct >>>>>>>> scopes). This >>>>>>>> patch implements overlapping distinct variable scopes in the >>>>>>>> coroutine frame, >>>>>>>> by storing the frame fields in nested unions of structs. This lowers >>>>>>>> the size >>>>>>>> of the frame for larger coroutines significantly, and makes them >>>>>>>> more usable >>>>>>>> on systems with limited memory. >>>>>>> >>>>>>> not a review (I will try to take a look at the weekend). >>>>>>> >>>>>>> but … this is one of the two main optimisations on my TODO - so cool >>>>>>> for doing it. >>>>>>> >>>>>>> (the other related optimisation is to eliminate frame entries for >>>>>>> scopes without any suspend >>>>>>> points - which has the potential to save even more space for code with >>>>>>> sparse use of co_xxxx) >>>>>>> >>>>>>> Iain >>>>>>> >>>>>>>> Bootstrapped and regression tested on x86_64-pc-linux-gnu; new >>>>>>>> test fails >>>>>>>> before the patch and succeeds after with no regressions. >>>>>>>> >>>>>>>> PR c++/105989 >>>>>>>> >>>>>>>> gcc/cp/ChangeLog: >>>>>>>> >>>>>>>> * coroutines.cc (struct local_var_info): Add field_access_path. >>>>>>>> (build_local_var_frame_access_expr): New. >>>>>>>> (transform_local_var_uses): Use >>>>>>>> build_local_var_frame_access_expr. >>>>>>>> (coro_make_frame_entry_id): New. >>>>>>>> (coro_make_frame_entry): Delegate to coro_make_frame_entry_id. >>>>>>>> (struct local_vars_frame_data): Add orig, field_access_path. >>>>>>>> (register_local_var_uses): Generate new frame layout. Create >>>>>>>> access >>>>>>>> paths to vars. >>>>>>>> (morph_fn_to_coro): Set new fields in local_vars_frame_data. >>>>>>>> >>>>>>>> >>>>>>>> gcc/testsuite/ChangeLog: >>>>>>>> >>>>>>>> * g++.dg/coroutines/pr105989.C: New test. >>>>>>>> >>>>>>>> <pr105989.patch> >>>>>>> >>>>>>> >>>>> >>>>> >>> <pr105989.patch> >> >>