Hi Christophe,
On 28/11/2024 10:22, Christophe Lyon wrote:
The VCTP instruction creates a Vector Tail Predicate in VPR.P0, based
on the input value, but also constrained by a VPT block (if present),
or if used within a DLSTP/LETP loop.
Therefore we need to inform the compiler that this intrinsic reads the
FPCXT register, otherwise it could make incorrect assumptions: for
instance in test7() from gcc.target/arm/mve/dlstp-compile-asm-2.c it
would hoist p1 = vctp32q (g) outside of the loop.
We chatted about this offlist but it's good to share here for others
too. I do not believe the transformation gcc is doing here is wrong. The
transformation we do for test 7, along with some others in the
testsuite, relies on analysis we do to check whether masks, that are not
the loop predicate mask, used within the loop have a side effect. In
other words, any instruction that is not predicated by the loop
predicate, be that unpredicated or predicated by another mask, triggers
an analysis to check whether the results are used in a safe way. Check
the comments above 'arm_mve_impl_predicated_p' in arm.cc
For test7 the non-loop predicate 'p1' is used to predicate a load inside
the loop, when dlstp'ed that load will be masked by 'p & p1' instead,
which means it could be loading less than initially intended, however,
the results of that load are used in a vadd predicated by 'p' which
means any values that it would have loaded if not masked by 'p' would
have been discarded in the add, so it has no relevant effect.
Furthermore, I also believe the compiler is already aware that VCTP
writes P0, given it has an input operand with the predicate
'vpr_register_operand' and the register constraint '=Up'. During DLSTP
transformation we rely on reads and writes to such operands to do our
transformation and it should also provide other backend passes with
enough information.
So I don't think this patch is needed.