Hi Michal, > On 14 May 2023, at 16:36, Michal Jankovič <michal.jankovi...@gmail.com> wrote: > > Rebased the patch to GCC 14 trunk. Bootstrapped and regression tested > again on x86_64-pc-linux-gnu, only difference is the new test failing > without the patch.
(as previously noted, I am much in favour of this optimisation) Do you have any metrics on the reductions in frame size for realistic coroutines? thanks Iain > > On Jul 13 2022, at 2:54 pm, Michal Jankovic > <michal.jankovi...@gmail.com> wrote: > >> Hi Iain, >> >> thanks for the info. I have some follow-up questions. >> >> On Jul 12 2022, at 7:11 pm, Iain Sandoe <i...@sandoe.co.uk> wrote: >> >>> Hi Michal, >>> >>>> On 12 Jul 2022, at 16:14, Michal Jankovič >>>> <michal.jankovi...@gmail.com> wrote: >>> >>>> One other related thing I would like to investigate is reducing the >>>> number of compiler generated variables in the frame, particularly >>>> _Coro_destroy_fn and _Coro_self_handle. >>>> >>>> As I understand it, _Coro_destroy_fn just sets a flag in >>>> _Coro_resume_index and calls _Coro_resume_fn; it should be possible to >>>> move this logic to __builtin_coro_destroy, so that only _Coro_resume_fn >>>> is stored in the frame; >>> >>> That is a particular point about GCC’s implementation … (it is not >>> neccesarily, or even >>> likely to be the same for other implementations) - see below. >>> >>> I was intending to do experiment with making the ramp/resume/destroy >>> value a parameter >>> to the actor function so that we would have something like - >>> >>> ramp calls actor(frame, 0) >>> resume calls actor(frame, 1) >>> destroy calls actor(frame, 2) >>> - the token values are illustrative, not intended to be a final version. >>> >>> I think that should allow for more inlining opportunites and possibly >>> a way forward to >>> frame elision (a.k.a halo). >>> >>>> this would however change the coroutine ABI - I don't know if that's >>>> a problem. >>> >>> The external ABI for the coroutine is the >>> resume, >>> destroy pointers >>> and the promise >>> and that one can find each of these from the frame pointer. >>> >>> This was agreed between the interested “vendors” so that one compiler >>> could invoke >>> coroutines built by another. So I do not think this is so much a >>> useful area to explore. >>> >> >> I understand. I still want to try to implement a more light-weight frame >> layout with just one function pointer; would it be possible to merge >> such a change if it was made opt-in via a compiler flag, eg >> `-fsmall-coroutine-frame`? My use-case for this is embedded environments >> with very limited memory, and I do not care about interoperability with >> other compilers there. >> >>> Also the intent is that an indirect call through the frame pointer is >>> the most frequent >>> operation so should be the most efficient. >>> resume() might be called many times, >>> destroy() just once thus it is a cold code path >>> - space can be important too - but interoperability was the goal here. >>> >>>> The _Coro_self_handle should be constructible on-demand from the >>>> frame address. >>> >>> Yes, and in the header the relevant items are all constexpr - so that >>> should happen in the >>> user’s code. I elected to have that value in the frame to avoid >>> recreating it each time - I >>> suppose that is a trade-off of one oiptimisation c.f. another … >> >> If the handle construction cannot be optimized out, and its thus >> a tradeoff between frame size and number of instructions, then this >> could also be enabled by a hypothetical `-fsmall-coroutine-frame`. >> >> Coming back to this: >> >>>>> (the other related optimisation is to eliminate frame entries for >>>>> scopes without any suspend >>>>> points - which has the potential to save even more space for code with >>>>> sparse use of co_xxxx) >> >> This would be nice; although it could encompassed by a more general >> optimization - eliminate frame entries for all variables which are not >> >> accessed (directly or via pointer / reference) beyond a suspend point. >> To be fair, I do not know how to get started on such an optimization, >> or if it is even possible to do on the frontend. This would however be >> immensely useful for reducing the frame size taken-up by complicated >> co_await expressions (among other things), for example, if I have a >> composed operation: >> >> co_await when_either(get_leaf_awaitable_1(), get_leaf_awaitable_2()); >> >> Right now, this creates space in the frame for the temporary 'leaf' >> awaitables, which were already moved into the composed awaitable. >> If the awaitable has an operator co_await that returns the real awaiter, >> the original awaitable is also stored in the frame, even if it >> is not referenced by the awaiter; another unused object gets stored if >> >> the .await_transform() customization point was used. >> >> What are your thoughts on the feasibility / difficulty of implementing >> such an optimization? >> >> Michal >> >>>> >>>> Do you have any advice / opinions on this before I try to implement it? >>> >>> Hopefully, the notes above help. >>> >>> I will rebase my latest code changes as soon as I have a chance and >>> put them somewhere >>> for you to look at - basically, these are to try and address the >>> correctness issues we face, >>> >>> Iain >>> >>> >>>> >>>> Michal >>>> >>>> On Jul 12 2022, at 4:08 pm, Iain Sandoe <i...@sandoe.co.uk> wrote: >>>> >>>>> Hi Michal, >>>>> >>>>>> On 12 Jul 2022, at 14:35, Michal Jankovič via Gcc-patches >>>>>> <gcc-patches@gcc.gnu.org> wrote: >>>>>> >>>>>> Currently, coroutine frames store all variables of a coroutine >>>>>> separately, >>>>>> even if their lifetime does not overlap (they are in distinct >>>>>> scopes). This >>>>>> patch implements overlapping distinct variable scopes in the >>>>>> coroutine frame, >>>>>> by storing the frame fields in nested unions of structs. This lowers >>>>>> the size >>>>>> of the frame for larger coroutines significantly, and makes them >>>>>> more usable >>>>>> on systems with limited memory. >>>>> >>>>> not a review (I will try to take a look at the weekend). >>>>> >>>>> but … this is one of the two main optimisations on my TODO - so cool >>>>> for doing it. >>>>> >>>>> (the other related optimisation is to eliminate frame entries for >>>>> scopes without any suspend >>>>> points - which has the potential to save even more space for code with >>>>> sparse use of co_xxxx) >>>>> >>>>> Iain >>>>> >>>>>> Bootstrapped and regression tested on x86_64-pc-linux-gnu; new >>>>>> test fails >>>>>> before the patch and succeeds after with no regressions. >>>>>> >>>>>> PR c++/105989 >>>>>> >>>>>> gcc/cp/ChangeLog: >>>>>> >>>>>> * coroutines.cc (struct local_var_info): Add field_access_path. >>>>>> (build_local_var_frame_access_expr): New. >>>>>> (transform_local_var_uses): Use build_local_var_frame_access_expr. >>>>>> (coro_make_frame_entry_id): New. >>>>>> (coro_make_frame_entry): Delegate to coro_make_frame_entry_id. >>>>>> (struct local_vars_frame_data): Add orig, field_access_path. >>>>>> (register_local_var_uses): Generate new frame layout. Create access >>>>>> paths to vars. >>>>>> (morph_fn_to_coro): Set new fields in local_vars_frame_data. >>>>>> >>>>>> gcc/testsuite/ChangeLog: >>>>>> >>>>>> * g++.dg/coroutines/pr105989.C: New test. >>>>>> >>>>>> <pr105989.patch> >>>>> >>>>> >>> >>> > <pr105989.patch>