Dhruv Chawla <dhr...@nvidia.com> writes:
> On 08/05/25 18:43, Richard Sandiford wrote:
>> Otherwise it looks good.  But I think we should think about how we
>> plan to integrate the related optimisation for register inputs.  E.g.:
>> 
>> int32x4_t foo(int32_t x) {
>>      return vsetq_lane_s32(x, vdupq_n_s32(0), 0);
>> }
>> 
>> generates:
>> 
>> foo:
>>          movi    v0.4s, 0
>>          ins     v0.s[0], w0
>>          ret
>> 
>> rather than a single UMOV.  Same idea when the input is in an FPR rather
>> than a GPR, but using FMOV rather than UMOV.
>> 
>> Conventionally, the register and memory forms should be listed as
>> alternatives in a single pattern, but that's somewhat complex because of
>> the different instruction availability for 64-bit+32-bit, 16-bit, and
>> 8-bit register operations.
>> 
>> My worry is that if we handle the register case as an entirely separate
>> patch, it would have to rewrite this one.
>
> I have been experimenting with this, and yeah, it gets quite messy when
> trying to handle both memory and register cases together. Would it be okay
> to enable the register case only for 64-/32-bit sizes? It would complicate
> the code only a little and could still be done with a single pattern. I've
> attached a patch that does the same.

Unfortunately, I don't think it's that easy -- more below.

> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 6e30dc48934..5368b7f21fe 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -1164,6 +1164,24 @@
>     [(set_attr "type" "neon_logic<q>")]
>   )
>   
> +(define_insn "*aarch64_simd_vec_set_low<mode>"
> +  [(set (match_operand:VALL_F16 0 "register_operand")
> +     (vec_merge:VALL_F16
> +       (vec_duplicate:VALL_F16
> +         (match_operand:<VEL> 1 "aarch64_simd_nonimmediate_operand"))
> +       (match_operand:VALL_F16 3 "aarch64_simd_imm_zero")
> +       (match_operand:SI 2 "const_int_operand")))]
> +  "TARGET_SIMD
> +   && ENDIAN_LANE_N (<nunits>, exact_log2 (INTVAL (operands[2]))) == 0
> +   && (aarch64_simd_mem_operand_p (operands[1]) ||
> +       GET_MODE_UNIT_BITSIZE (<MODE>mode) >= 32)"

Adding a condition like this isn't safe, because the constraints:

> +  {@ [ cons: =0 , 1   ; attrs: type  ]
> +     [ w        , w   ; neon_move<q> ] fmov\t%<Vetype>0, %<Vetype>1
> +     [ w        , r   ; neon_from_gp ] fmov\t%<Vetype>0, %<vwcore>1
> +     [ w        , Utv ; f_loads      ] ldr\t%<Vetype>0, %1
> +  }
> +)

promise that the w and r sources are supported for all modes.

When reloading an existing instruction, LRA only needs to look at
the instruction's constraints.  It doesn't need to look at the C++
condition or the predicates.

This means that if we provide a w,w alternative for V8HF (say), LRA
might use it in certain cases.  We could then either get wrong code or
an ICE, depending on whether something notices that the instruction no
longer matches its C++ condition.

So even if we just added the 32-bit and 64-bit register cases,
I think we'd need to separate them from the 8-bit and 16-bit cases.
And like I say, the 8/16-bit pattern would then need to be split
and rewritten if the 16-bit register case was added later.

Thanks,
Richard

Reply via email to