================
@@ -1959,9 +2064,12 @@ multiclass VCMLA_ROTS<string type, string lanety, string 
laneqty> {
 
     let isLaneQ = 1 in  {
       // vcmla{ROT}_laneq
+      // ACLE specifies that the fp16 vcmla_#ROT_laneq variant has an 
immedaite range of 0 <= lane <= 1.
+      // fp16 is the only variant for which these two differ.
+      // https://developer.arm.com/documentation/ihi0073/latest/ 
+      defvar getlanety = !if(!eq(type, "h"), lanety, laneqty);
       def : SOpInst<"vcmla" # ROT # "_laneq", "...QI", type,  Op<(call "vcmla" 
# ROT, $p0, $p1,
-              (bitcast $p0, (dup_typed lanety, (call "vget_lane", (bitcast 
laneqty, $p2), $p3))))>>;
-
+                (bitcast $p0, (dup_typed lanety, (call "vget_lane", (bitcast 
getlanety, $p2), $p3))))>>;
----------------
SpencerAbson wrote:

The `vcmla` intrinsics are instantiated for each base type ([f16, f32, f64])  
under this `VCMLA_ROTS` multiclass. For example:

`defm VCMLA_F32        : VCMLA_ROTS<"f", "uint64x1_t", "uint64x2_t">;`

The first argument is a vector with the number of lanes accessible by the 
`lane` argument in  [ 
vcmlaq_lane](https://developer.arm.com/architectures/instruction-sets/intrinsics/#q=vcmlaq_lane)
  and 
[vcmla_lane](https://developer.arm.com/architectures/instruction-sets/intrinsics/#q=vcmla_lane)
 type inrinsics. The second argument is that for 
[vcmlaq_laneq](https://developer.arm.com/architectures/instruction-sets/intrinsics/#q=vcmlaq_laneq)
 and 
[vcmla_laneq](https://developer.arm.com/architectures/instruction-sets/intrinsics/#q=vcmla_laneq)
 type intrinsics . This is done to give a vector of the correct number of 
elements to the call to `vget` in the instruction definition, so that the range 
of the immediate is correctly bounded (the range of this immediate is such that 
it can access some of the lower half of the vector only, which differs from 
traditional `_lane` type intrinsics where we access an arbitrary lane).

There is a problem with this approach for the `f16`/`h` base type; the range of 
the immediate in `vcmlaq_laneq_f16` is different to that for `vcmla_laneq_f16` 
(` 0<=lane<=3` and `0<=lane<=1` respectively). This is how the intrinsics are 
instantiated:

`defm VCMLA_FP16  : VCMLA_ROTS<"h", "uint32x2_t", "uint32x4_t">;`

The simplest fix I found for this case was to add the `getlanety` conditional 
which will select `uint32x2_t` for `vcmla_laneq_f16` and `uin32x4_t` for 
`vcmlaq_laneq_f16`, giving the correct range for the immediates.



https://github.com/llvm/llvm-project/pull/100278
_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to