https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114801
--- Comment #38 from avieira at gcc dot gnu.org --- > At least if the behavior is either perform the operation on all elements and > then based on the 16 bits in the predicate choose result between the newly > computed result and something else on byte by byte basis. Yeah this is what happens for predicated arithmetic operations. If you want to read it up yourself, look up page 1032 in https://developer.arm.com/documentation/ddi0553/by?lang=en That is the section 'Operation for all encodings' in Chapter C2.4, which describes how a vector add (VADD) works. For MVE that section describes the operation per 'beat' which is always 32-bits, so for your mental model imagine that section happens 4 times per vector operation. Basically if you look at the pseudo-code, the predicate mask is only used for 'writing back the result'. In other words, it does the addition as if we are going to use everything and its doing a regular unpredicated vector add. Then when it comes back to write back to the result register, it ignores the size of the elements altogether and for each 'beat' looks at the result as a collection of 4-bytes, i.e. over the course of all 4-beats it looks at the result Q register as 16-bytes, and for each of those bytes it looks up the corresponding bit in the 16-bit predicate mask and if that bit is 1 it overrides the current byte in the result Q register with the same byte as the result of the addition and if its 0 it leaves the existing byte as is. To give an example: VMSR P0, r0 VPST VADDT.i32 Q0, Q1, Q2 With r0 = 0x8181 Q0 = {0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff} Q1 = { 0x11, 0x11, 0x11, 0x11} Q2 = {0xaa000000, 0xaa000000, 0xaa000000, 0xaa000000} Will lead to Q0 having the following values: Q0 = { 0xfffff11, 0xaaffffff, 0xffffff11, 0xaaffffff} This is because the mask 0x08181 will only overwrite the least significant byte of the 32-bits for even elements and the most significant byte for the odd elements in the result register with the result of Q1 + Q2 = {0xaa000011, 0xaa000011, 0xaa000011, 0xaa000011}.