On 23/06/2023 11:23, Andre Vieira (lists) wrote:
+ if (insn != arm_mve_get_loop_vctp (body))
+ {
probably a good idea to invert the condition here and return false,
helps reducing the indenting in this function.
Done, thanks
+ /* Starting from the current insn, scan backwards through the insn
+ chain until BB_HEAD: "for each insn in the BB prior to the
current".
+ */
There's a trailing whitespace after insn, but also I'd rewrite this
bit. The "for each insn in the BB prior to the current" is superfluous
and even confusing to me. How about:
"Scan backwards from the current INSN through the instruction chain
until the start of the basic block. "
Yes, agreed, it wasn't very clear. Done.
I find 'that previous insn' to be confusing as you don't mention any
previous insn before. So how about something along the lines of:
'If a previous insn defines a register that INSN uses then return true
if...'
Done
Do we need to check: 'insn != prev_insn' ? Any reason why you can't
start the loop with:
'for (rtx_insn *prev_insn = PREV_INSN (insn);'
True! Done.
Now I also found a case where things might go wrong in:
+ /* Look at all the DEFs of that previous insn: if one of them
is on
+ the same REG as our current insn, then recurse in order to
check
+ that insn's USEs. If any of these insns return true as
+ MVE_VPT_UNPREDICATED_INSN_Ps, then the whole chain is
affected
+ by the change in behaviour from being placed in dlstp/letp
loop.
+ */
+ df_ref prev_insn_defs = NULL;
+ FOR_EACH_INSN_DEF (prev_insn_defs, prev_insn)
+ {
+ if (DF_REF_REGNO (insn_uses) == DF_REF_REGNO (prev_insn_defs)
+ && insn != prev_insn
+ && body == BLOCK_FOR_INSN (prev_insn)
+ && !arm_mve_vec_insn_is_predicated_with_this_predicate
+ (insn, vctp_vpr_generated)
+ && arm_mve_check_df_chain_back_for_implic_predic
+ (prev_insn, vctp_vpr_generated))
+ return true;
+ }
The body == BLOCK_FOR_INSN (prev_insn) hinted me at it, if a def comes
from outside of the BB (so outside of the loop's body) then its by
definition unpredicated by vctp. I think you want to check that if
prev_insn defines a register used by insn then return true if
prev_insn isn't in the same BB or has a chain that is not predicated,
i.e.: '!arm_mve_vec_insn_is_predicated_with_this_predicate (insn,
vctp_vpr_generated) && arm_mve_check_df_chain_back_for_implic_predic
prev_insn, vctp_vpr_generated))' you check body != BLOCK_FOR_INSN
(prev_insn)'
Yes, you're right, this is vulnerable here. A neater fix to this (I
think?) is to make the above REGNO_REG_SET_P more generic, so that it
covers all scalar values and scalar ops, as well.
Then it's a "if this insn in the loop has any input that originates
outside the bb, then it's unsafe" check and the recursive loop backwards
is only for the recursive "are any previous insns unsafe"
I also found some other issues, this currently loloops:
uint16_t test (uint16_t *a, int n)
{
uint16_t res =0;
while (n > 0)
{
mve_pred16_t p = vctp16q (n);
uint16x8_t va = vldrhq_u16 (a);
res = vaddvaq_u16 (res, va);
res = vaddvaq_p_u16 (res, va, p);
a += 8;
n -= 8;
}
return res;
}
But it shouldn't, this is because there's a lack of handling of across
vector instructions. Luckily in MVE all across vector instructions
have the side-effect that they write to a scalar register, even the
vshlcq instruction (it writes to a scalar carry output).
Added support for them (you were right, there was some special handling
needed!)
Did this lead me to find an ICE with:
uint16x8_t test (uint16_t *a, int n)
{
uint16x8_t res = vdupq_n_u16 (0);
while (n > 0)
{
uint16_t carry = 0;
mve_pred16_t p = vctp16q (n);
uint16x8_t va = vldrhq_u16 (a);
res = vshlcq_u16 (va, &carry, 1);
res = vshlcq_m_u16 (res, &carry, 1 , p);
a += 8;
n -= 8;
}
return res;
}
This is because:
+ /* If the USE is outside the loop body bb, or it is inside,
but
+ is an unpredicated store to memory. */
+ if (BLOCK_FOR_INSN (insn) != BLOCK_FOR_INSN (next_use_insn)
+ || (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate
+ (next_use_insn, vctp_vpr_generated)
+ && mve_memory_operand
+ (SET_DEST (single_set (next_use_insn)),
+ GET_MODE (SET_DEST (single_set (next_use_insn))))))
+ return true;
Assumes single_set doesn't return 0.
Thanks! That is indeed correct.
Corrected this by having a utility function to scan insn operands and
check against mve_memory_operand that supports any number of
operands/SETs in the insn
Let's deal with these issues and I'll continue to review.
On 15/06/2023 12:47, Stamatis Markianos-Wright via Gcc-patches wrote:
Hi all,
This is the 2/2 patch that contains the functional changes needed
for MVE Tail Predicated Low Overhead Loops. See my previous email
for a general introduction of MVE LOLs.
This support is added through the already existing loop-doloop
mechanisms that are used for non-MVE dls/le looping.
Mid-end changes are:
1) Relax the loop-doloop mechanism in the mid-end to allow for
decrement numbers other that -1 and for `count` to be an
rtx containing a simple REG (which in this case will contain
the number of elements to be processed), rather
than an expression for calculating the number of iterations.
2) Added a new df utility function: `df_bb_regno_only_def_find`
that
will return the DEF of a REG only if it is DEF-ed once within
the
basic block.
And many things in the backend to implement the above optimisation:
3) Implement the `arm_predict_doloop_p` target hook to instruct
the
mid-end about Low Overhead Loops (MVE or not), as well as
`arm_loop_unroll_adjust` which will prevent unrolling of any
loops
that are valid for becoming MVE Tail_Predicated Low Overhead
Loops
(unrolling can transform a loop in ways that invalidate the
dlstp/
letp tranformation logic and the benefit of the dlstp/letp loop
would be considerably higher than that of unrolling)
4) Appropriate changes to the define_expand of doloop_end, new
patterns for dlstp and letp, new iterators, unspecs, etc.
5) `arm_mve_loop_valid_for_dlstp` and a number of checking
functions:
* `arm_mve_dlstp_check_dec_counter`
* `arm_mve_dlstp_check_inc_counter`
* `arm_mve_check_reg_origin_is_num_elems`
* `arm_mve_check_df_chain_back_for_implic_predic`
* `arm_mve_check_df_chain_fwd_for_implic_predic_impact`
This all, in smoe way or another, are running checks on the loop
structure in order to determine if the loop is valid for
dlstp/letp
transformation.
6) `arm_attempt_dlstp_transform`: (called from the define_expand of
doloop_end) this function re-checks for the loop's
suitability for
dlstp/letp transformation and then implements it, if possible.
7) Various utility functions:
*`arm_mve_get_vctp_lanes` to map
from vctp unspecs to number of lanes, and
`arm_get_required_vpr_reg`
to check an insn to see if it requires the VPR or not.
* `arm_mve_get_loop_vctp`
* `arm_mve_get_vctp_lanes`
* `arm_emit_mve_unpredicated_insn_to_seq`
* `arm_get_required_vpr_reg`
* `arm_get_required_vpr_reg_param`
* `arm_get_required_vpr_reg_ret_val`
* `arm_mve_vec_insn_is_predicated_with_this_predicate`
* `arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate`
No regressions on arm-none-eabi with various targets and on
aarch64-none-elf. Thoughts on getting this into trunk?
Thank you,
Stam Markianos-Wright
gcc/ChangeLog:
* config/arm/arm-protos.h (arm_target_insn_ok_for_lob):
Rename to...
(arm_target_bb_ok_for_lob): ...this
(arm_attempt_dlstp_transform): New.
* config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): New.
(TARGET_PREDICT_DOLOOP_P): New.
(arm_block_set_vect):
(arm_target_insn_ok_for_lob): Rename from
arm_target_insn_ok_for_lob.
(arm_target_bb_ok_for_lob): New.
(arm_mve_get_vctp_lanes): New.
(arm_get_required_vpr_reg): New.
(arm_get_required_vpr_reg_param): New.
(arm_get_required_vpr_reg_ret_val): New.
(arm_mve_get_loop_vctp): New.
(arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate): New.
(arm_mve_vec_insn_is_predicated_with_this_predicate): New.
(arm_mve_check_df_chain_back_for_implic_predic): New.
(arm_mve_check_df_chain_fwd_for_implic_predic_impact): New.
(arm_mve_check_reg_origin_is_num_elems): New.
(arm_mve_dlstp_check_inc_counter): New.
(arm_mve_dlstp_check_dec_counter): New.
(arm_mve_loop_valid_for_dlstp): New.
(arm_predict_doloop_p): New.
(arm_loop_unroll_adjust): New.
(arm_emit_mve_unpredicated_insn_to_seq): New.
(arm_attempt_dlstp_transform): New.
* config/arm/iterators.md (DLSTP): New.
(mode1): Add DLSTP mappings.
* config/arm/mve.md (*predicated_doloop_end_internal): New.
(dlstp<mode1>_insn): New.
* config/arm/thumb2.md (doloop_end): Update for MVE LOLs.
* config/arm/unspecs.md: New unspecs.
* df-core.cc (df_bb_regno_only_def_find): New.
* df.h (df_bb_regno_only_def_find): New.
* loop-doloop.cc (doloop_condition_get): Relax conditions.
(doloop_optimize): Add support for elementwise LoLs.
gcc/testsuite/ChangeLog:
* gcc.target/arm/lob.h: Update framework.
* gcc.target/arm/lob1.c: Likewise.
* gcc.target/arm/lob6.c: Likewise.
* gcc.target/arm/mve/dlstp-compile-asm.c: New test.
* gcc.target/arm/mve/dlstp-int16x8.c: New test.
* gcc.target/arm/mve/dlstp-int32x4.c: New test.
* gcc.target/arm/mve/dlstp-int64x2.c: New test.
* gcc.target/arm/mve/dlstp-int8x16.c: New test.
* gcc.target/arm/mve/dlstp-invalid-asm.c: New test.