Hi Michal,

> On 14 May 2023, at 16:36, Michal Jankovič <michal.jankovi...@gmail.com> wrote:
> 
> Rebased the patch to GCC 14 trunk. Bootstrapped and regression tested
> again on x86_64-pc-linux-gnu, only difference is the new test failing
> without the patch.

(as previously noted, I am much in favour of this optimisation)

Do you have any metrics on the reductions in frame size for realistic 
coroutines?

thanks
Iain

> 
> On Jul 13 2022, at 2:54 pm, Michal Jankovic
> <michal.jankovi...@gmail.com> wrote:
> 
>> Hi Iain,
>> 
>> thanks for the info. I have some follow-up questions.
>> 
>> On Jul 12 2022, at 7:11 pm, Iain Sandoe <i...@sandoe.co.uk> wrote:
>> 
>>> Hi Michal,
>>> 
>>>> On 12 Jul 2022, at 16:14, Michal Jankovič
>>>> <michal.jankovi...@gmail.com> wrote:
>>> 
>>>> One other related thing I would like to investigate is reducing the
>>>> number of compiler generated variables in the frame, particularly
>>>> _Coro_destroy_fn and _Coro_self_handle.    
>>>> 
>>>> As I understand it, _Coro_destroy_fn just sets a flag in
>>>> _Coro_resume_index and calls _Coro_resume_fn; it should be possible to
>>>> move this logic to __builtin_coro_destroy, so that only _Coro_resume_fn
>>>> is stored in the frame;
>>> 
>>> That is a particular point about GCC’s implementation … (it is not
>>> neccesarily, or even
>>> likely to be the same for other implementations) - see below.
>>> 
>>> I was intending to do experiment with making the ramp/resume/destroy
>>> value a parameter
>>> to the actor function so that we would have something like -
>>> 
>>> ramp calls      actor(frame, 0)
>>> resume calls  actor(frame, 1)
>>> destroy calls  actor(frame, 2)   
>>> - the token values are illustrative, not intended to be a final version.
>>> 
>>> I think that should allow for more inlining opportunites and possibly
>>> a way forward to
>>> frame elision (a.k.a halo).
>>> 
>>>> this would however change the coroutine ABI - I don't know if that's
>>>> a problem.
>>> 
>>> The external ABI for the coroutine is the   
>>> resume,
>>> destroy pointers   
>>> and the promise   
>>> and that one can find each of these from the frame pointer.
>>> 
>>> This was agreed between the interested “vendors” so that one compiler
>>> could invoke
>>> coroutines built by another.  So I do not think this is so much a
>>> useful area to explore.
>>> 
>> 
>> I understand. I still want to try to implement a more light-weight frame
>> layout with just one function pointer; would it be possible to merge
>> such a change if it was made opt-in via a compiler flag, eg
>> `-fsmall-coroutine-frame`? My use-case for this is embedded environments
>> with very limited memory, and I do not care about interoperability with
>> other compilers there.   
>> 
>>> Also the intent is that an indirect call through the frame pointer is
>>> the most frequent
>>> operation so should be the most efficient.    
>>> resume() might be called many times,   
>>> destroy() just once thus it is a cold code path   
>>> - space can be important too - but interoperability was the goal here.
>>> 
>>>> The _Coro_self_handle should be constructible on-demand from the
>>>> frame address.
>>> 
>>> Yes, and in the header the relevant items are all constexpr - so that
>>> should happen in the
>>> user’s code.  I elected to have that value in the frame to avoid
>>> recreating it each time - I
>>> suppose that is a trade-off of one oiptimisation c.f. another …   
>> 
>> If the handle construction cannot be optimized out, and its thus   
>> a tradeoff between frame size and number of instructions, then this
>> could also be enabled by a hypothetical `-fsmall-coroutine-frame`.
>> 
>> Coming back to this:
>> 
>>>>> (the other related optimisation is to eliminate frame entries for
>>>>> scopes without any suspend
>>>>> points - which has the potential to save even more space for code with
>>>>> sparse use of co_xxxx)
>> 
>> This would be nice; although it could encompassed by a more general   
>> optimization - eliminate frame entries for all variables which are not
>> 
>> accessed (directly or via pointer / reference) beyond a suspend point.
>> To be fair, I do not know how to get started on such an optimization,
>> or if it is even possible to do on the frontend. This would however be
>> immensely useful for reducing the frame size taken-up by complicated
>> co_await expressions (among other things), for example, if I have a
>> composed operation:
>> 
>> co_await when_either(get_leaf_awaitable_1(), get_leaf_awaitable_2());
>> 
>> Right now, this creates space in the frame for the temporary 'leaf'   
>> awaitables, which were already moved into the composed awaitable.
>> If the awaitable has an operator co_await that returns the real awaiter,
>> the original awaitable is also stored in the frame, even if it   
>> is not referenced by the awaiter; another unused object gets stored if
>> 
>> the .await_transform() customization point was used.
>> 
>> What are your thoughts on the feasibility / difficulty of implementing
>> such an optimization?
>> 
>> Michal
>> 
>>>> 
>>>> Do you have any advice / opinions on this before I try to implement it?
>>> 
>>> Hopefully, the notes above help.
>>> 
>>> I will rebase my latest code changes as soon as I have a chance and
>>> put them somewhere
>>> for you to look at - basically, these are to try and address the
>>> correctness issues we face,
>>> 
>>> Iain
>>> 
>>> 
>>>> 
>>>> Michal
>>>> 
>>>> On Jul 12 2022, at 4:08 pm, Iain Sandoe <i...@sandoe.co.uk> wrote:
>>>> 
>>>>> Hi Michal,
>>>>> 
>>>>>> On 12 Jul 2022, at 14:35, Michal Jankovič via Gcc-patches
>>>>>> <gcc-patches@gcc.gnu.org> wrote:
>>>>>> 
>>>>>> Currently, coroutine frames store all variables of a coroutine 
>>>>>> separately,
>>>>>> even if their lifetime does not overlap (they are in distinct
>>>>>> scopes). This
>>>>>> patch implements overlapping distinct variable scopes in the
>>>>>> coroutine frame,
>>>>>> by storing the frame fields in nested unions of structs. This lowers
>>>>>> the size
>>>>>> of the frame for larger coroutines significantly, and makes them
>>>>>> more usable
>>>>>> on systems with limited memory.
>>>>> 
>>>>> not a review (I will try to take a look at the weekend).
>>>>> 
>>>>> but … this is one of the two main optimisations on my TODO - so cool
>>>>> for doing it.
>>>>> 
>>>>> (the other related optimisation is to eliminate frame entries for
>>>>> scopes without any suspend
>>>>> points - which has the potential to save even more space for code with
>>>>> sparse use of co_xxxx)
>>>>> 
>>>>> Iain
>>>>> 
>>>>>> Bootstrapped and regression tested on x86_64-pc-linux-gnu; new
>>>>>> test fails
>>>>>> before the patch and succeeds after with no regressions.
>>>>>> 
>>>>>>  PR c++/105989
>>>>>> 
>>>>>> gcc/cp/ChangeLog:
>>>>>> 
>>>>>>  * coroutines.cc (struct local_var_info): Add field_access_path.
>>>>>>  (build_local_var_frame_access_expr): New.
>>>>>>  (transform_local_var_uses): Use build_local_var_frame_access_expr.
>>>>>>  (coro_make_frame_entry_id): New.
>>>>>>  (coro_make_frame_entry): Delegate to coro_make_frame_entry_id.
>>>>>>  (struct local_vars_frame_data): Add orig, field_access_path.
>>>>>>  (register_local_var_uses): Generate new frame layout. Create access
>>>>>>  paths to vars.
>>>>>>  (morph_fn_to_coro): Set new fields in local_vars_frame_data.    
>>>>>> 
>>>>>> gcc/testsuite/ChangeLog:
>>>>>> 
>>>>>>  * g++.dg/coroutines/pr105989.C: New test.
>>>>>> 
>>>>>> <pr105989.patch>
>>>>> 
>>>>> 
>>> 
>>> 
> <pr105989.patch>

Reply via email to