Re: [GCC16,RFC,V2 03/14] aarch64: add new insn definition for st2g

Richard Sandiford Tue, 06 May 2025 03:53:23 -0700

Indu Bhagat <indu.bha...@oracle.com> writes:
> On 4/15/25 9:21 AM, Richard Sandiford wrote:
>> Indu Bhagat <indu.bha...@oracle.com> writes:
>>> Store Allocation Tags (st2g) is an Armv8.5-A memory tagging (MTE)
>>> instruction. It stores an allocation tag to two tag granules of memory.
>>>
>>> TBD:
>>>    - Not too sure what is the best way to generate the st2g yet; A
>>>      subsequent patch will emit them in one of the target hooks.
>> 
>> Regarding the previous thread about this:
>> 
>>      https://gcc.gnu.org/pipermail/gcc-patches/2024-November/668671.html
>> 
>> and your question about whether all types of store tag instructions
>> should be volatile: if we went for that approach, then yeah, I think so.
>> 
>> As I mentioned there, I don't think we should use (unspec ...) memory
>> addresses.
>> 
>> But thinking more about it: can we guarantee that GCC will only use
>> these instruction patterns with base registers that are aligned to
>> 16 bytes?  If so, then perhaps an alternative would be to model
>> them as read-modify-write operations to the whole granule (even though
>> the actual instructions leave normal memory untouched and only change
>> the tags).  That is, rather than:
>> 
>>>
>>> gcc/ChangeLog:
>>>
>>>     * config/aarch64/aarch64.md (st2g): New definition.
>>> ---
>>>   gcc/config/aarch64/aarch64.md | 20 ++++++++++++++++++++
>>>   1 file changed, 20 insertions(+)
>>>
>>> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
>>> index 0c7aebb838cd..d3223e275c51 100644
>>> --- a/gcc/config/aarch64/aarch64.md
>>> +++ b/gcc/config/aarch64/aarch64.md
>>> @@ -8475,6 +8475,26 @@
>>>     [(set_attr "type" "memtag")]
>>>   )
>>>   
>>> +;; ST2G updates allocation tags for two memory granules (i.e. 32 bytes) at
>>> +;; once, without zero initialization.
>>> +(define_insn "st2g"
>>> +  [(set (mem:QI (unspec:DI
>>> +    [(plus:DI (match_operand:DI 1 "register_operand" "rk")
>>> +              (match_operand:DI 2 "aarch64_granule16_simm9" "i"))]
>>> +    UNSPEC_TAG_SPACE))
>>> +   (and:QI (lshiftrt:DI (match_operand:DI 0 "register_operand" "rk")
>>> +                        (const_int 56)) (const_int 15)))
>>> +   (set (mem:QI (unspec:DI
>>> +    [(plus:DI (match_dup 1)
>>> +              (match_operand:DI 3 "aarch64_granule16_simm9" "i"))]
>>> +    UNSPEC_TAG_SPACE))
>>> +   (and:QI (lshiftrt:DI (match_dup 0)
>>> +                        (const_int 56)) (const_int 15)))]
>>> +  "TARGET_MEMTAG && (INTVAL (operands[2]) - 16 == INTVAL (operands[3]))"
>>> +  "st2g\\t%0, [%1, #%2]"
>>> +  [(set_attr "type" "memtag")]
>>> +)
>>> +
>> 
>> ...this, we could do:
>> 
>> (set (match_operand:OI 0 "aarch64_granule_memory_operand" "+<new 
>> constraint>")
>>       (unspec_volatile:OI
>>         [(match_dup 0)
>>          (match_operand:DI 1 "register_operand" "rk")]
>>         UNSPECV...))
>> 
>> Using OImode (256 bytes) indicates that two full granules are affected
>> by the store, but that no other memory is affected.  The (match_dup 0)
>> read indicates that this store does not kill any previous store to the
>> same 256 bytes (since the contents of normal memory don't change).
>> The unspec_volatile should ensure that nothing tries to remove the
>> store as dead (which would especially be a problem when clearing tags).
>> 
>
> I dont understand the statement: "The (match_dup 0) read indicates that 
> this store does not kill any previous store to the same 256 bytes".


The problem is that if we had, say:

  (set (mem:TI x) (const_int 0))
  (set (mem:TI x) (unspec_volatile [(reg:DI base)] UNSPECV...))

the (mem:TI x) in the second instruction would seem to overwrite the
result of the first instruction, making the first instruction dead.
I would expect DSE to get rid of the zeroing in this case.

> I am currently seeing an issue (mentioned below).
>
>> Using a single memory operand for the whole instruction has the advantage
>> of only requiring one offset to be represented, rather than having both
>> operands 2 and 3 in the original pattern.  It also copes more easily
>> with cases where the offset is zero for the first or second address,
>> since no (plus ...) should be present in that case.
>> 
>
> Currently I am using:
>
> (define_insn "stg"
>    [(set (match_operand:TI 0 "aarch64_granule16_memory_operand" "+Umg")
>        (unspec_volatile:TI
>          [(match_dup 0)
>           (match_operand:DI 1 "register_operand" "rk")]
>          UNSPECV_TAG_SPACE))]
>    "TARGET_MEMTAG"
>    "stg\\t%1, %0"
>    [(set_attr "type" "memtag")]
> )
>
> ...
>
> (define_predicate "aarch64_granule16_memory_operand"
>    (and (match_test "TARGET_MEMTAG")
>         (and (match_code "mem")
>              (match_test "aarch64_granule16_memory_address_p (op)"))))
>
> where aarch64_granule16_memory_address_p () simply checks for 
> aarch64_granule16_simm9 immediate for now.
>
> Basically, I was expecting the generation of a POST_MODIFY for :
>          stg     x0, [x2]
>          add     x2, x2, 16
>
> But in the rtl dump (XX.c.300r.auto_inc_dec):
>
> (insn 31 44 32 3 (set (mem:TI (plus:DI (reg/f:DI 122 [ _10 ])
>                  (const_int 0 [0])) [0  S16 A8])

Not related to the issue you're hitting, but this indicates that
something has gone wrong somewhere.  We shouldn't see an unfolded
(plus ... (const_int 0)) in the RTL insns.

>          (unspec_volatile:TI [
>                  (mem:TI (plus:DI (reg/f:DI 122 [ _10 ])
>                          (const_int 0 [0])) [0  S16 A8])
>                  (reg:DI 120)
>              ] UNSPECV_TAG_SPACE)) "alloca-1.c":8:12 1231 {stg}
>       (nil))
> (insn 32 31 33 3 (set (reg/f:DI 122 [ _10 ])
>          (plus:DI (reg/f:DI 122 [ _10 ])
>              (const_int 16 [0x10]))) "alloca-1.c":8:12 121 {*adddi3_aarch64}
>       (nil))
>
> starting bb 3
>     33: {cc:CC=cmp(r121:DI,0x10);r121:DI=r121:DI-0x10;}
>     32: r122:DI=r122:DI+0x10
>     31: [r122:DI+0]=unspec/v[[r122:DI+0],r120:DI] 17
> mem count failure
> mem count failure

Yeah, we'd need to update auto-inc-dec for this case.

But I'd forgotten that match_dup applies strict rtx_equal_p equality,
whereas here we'd need the looser operands_match_p equality, with:

  /* If two operands must match, because they are really a single
     operand of an assembler insn, then two postincrements are invalid
     because the assembler insn would increment only once.
     On the other hand, a postincrement matches ordinary indexing
     if the postincrement is the output operand.  */
  if (code == POST_DEC || code == POST_INC || code == POST_MODIFY)
    return operands_match_p (XEXP (x, 0), y);
  /* Two preincrements are invalid
     because the assembler insn would increment only once.
     On the other hand, a preincrement matches ordinary indexing
     if the preincrement is the input operand.
     In this case, return 2, since some callers need to do special
     things when this happens.  */
  if (GET_CODE (y) == PRE_DEC || GET_CODE (y) == PRE_INC
      || GET_CODE (y) == PRE_MODIFY)
    return operands_match_p (x, XEXP (y, 0)) ? 2 : 0;

So we probably do want two operands with matching constraints after all,
rather than a simple match_dup:

  (match_operand:TI 1 "..._memory_operand" "0")

It's a long time since I worked on a target that wanted to use this
matching feature though.

An alternative would be to define separate instructions that do the
register increment in parallel with the memory operation, like the
ldp/stp patterns.

Thanks,
Richard

Re: [GCC16,RFC,V2 03/14] aarch64: add new insn definition for st2g

Reply via email to