15 Regression] arm: ICE in find_cached_value, at rtx-vector-builder.cc:100 with MVE intrinsics

avieira at gcc dot gnu.org via Gcc-bugs Fri, 01 Nov 2024 06:19:47 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114801


--- Comment #38 from avieira at gcc dot gnu.org ---
> At least if the behavior is either perform the operation on all elements and
> then based on the 16 bits in the predicate choose result between the newly
> computed result and something else on byte by byte basis.  
Yeah this is what happens for predicated arithmetic operations.

If you want to read it up yourself, look up page 1032 in
https://developer.arm.com/documentation/ddi0553/by?lang=en

That is the section 'Operation for all encodings' in Chapter C2.4, which
describes how a vector add (VADD) works. For MVE that section describes the
operation per 'beat' which is always 32-bits, so for your mental model imagine
that section happens 4 times per vector operation.

Basically if you look at the pseudo-code, the predicate mask is only used for
'writing back the result'. In other words, it does the addition as if we are
going to use everything and its doing a regular unpredicated vector add.

Then when it comes back to write back to the result register, it ignores the
size of the elements altogether and for each 'beat' looks at the result as a
collection of 4-bytes, i.e. over the course of all 4-beats it looks at the
result Q register as 16-bytes, and for each of those bytes it looks up the
corresponding bit in the 16-bit predicate mask and if that bit is 1 it
overrides the current byte in the result Q register with the same byte as the
result of the addition and if its 0 it leaves the existing byte as is.

To give an example:
VMSR P0, r0
VPST
VADDT.i32 Q0, Q1, Q2

With
r0 = 0x8181
Q0 = {0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff}
Q1 = {      0x11,       0x11,       0x11,       0x11}
Q2 = {0xaa000000, 0xaa000000, 0xaa000000, 0xaa000000}

Will lead to Q0 having the following values:
Q0 = { 0xfffff11, 0xaaffffff, 0xffffff11, 0xaaffffff}

This is because the mask 0x08181 will only overwrite the least significant byte
of the 32-bits for even elements and the most significant byte for the odd
elements in the result register with the result of Q1 + Q2 = {0xaa000011,
0xaa000011, 0xaa000011, 0xaa000011}.

[Bug target/114801] [14/15 Regression] arm: ICE in find_cached_value, at rtx-vector-builder.cc:100 with MVE intrinsics

Reply via email to