Hi Christophe,

On 28/11/2024 10:22, Christophe Lyon wrote:
The VCTP instruction creates a Vector Tail Predicate in VPR.P0, based
on the input value, but also constrained by a VPT block (if present),
or if used within a DLSTP/LETP loop.

Therefore we need to inform the compiler that this intrinsic reads the
FPCXT register, otherwise it could make incorrect assumptions: for
instance in test7() from gcc.target/arm/mve/dlstp-compile-asm-2.c it
would hoist p1 = vctp32q (g) outside of the loop.

We chatted about this offlist but it's good to share here for others too. I do not believe the transformation gcc is doing here is wrong. The transformation we do for test 7, along with some others in the testsuite, relies on analysis we do to check whether masks, that are not the loop predicate mask, used within the loop have a side effect. In other words, any instruction that is not predicated by the loop predicate, be that unpredicated or predicated by another mask, triggers an analysis to check whether the results are used in a safe way. Check the comments above 'arm_mve_impl_predicated_p' in arm.cc

For test7 the non-loop predicate 'p1' is used to predicate a load inside the loop, when dlstp'ed that load will be masked by 'p & p1' instead, which means it could be loading less than initially intended, however, the results of that load are used in a vadd predicated by 'p' which means any values that it would have loaded if not masked by 'p' would have been discarded in the add, so it has no relevant effect.

Furthermore, I also believe the compiler is already aware that VCTP writes P0, given it has an input operand with the predicate 'vpr_register_operand' and the register constraint '=Up'. During DLSTP transformation we rely on reads and writes to such operands to do our transformation and it should also provide other backend passes with enough information.

So I don't think this patch is needed.


Reply via email to