Hi,

Current vectorizer doesn't support masked loads for SLP. We should add that, to
allow things like:

void
f (int *restrict x, int *restrict y, int *restrict z, int n)
{
  for (int i = 0; i < n; i += 2)
    {
      x[i] = y[i] ? z[i] : 1;
      x[i + 1] = y[i + 1] ? z[i + 1] : 2;
    }
}

to be vectorized using contiguous loads rather than LD2 and ST2.

This patch was motivated by SVE, but it is completely generic and should apply
to any architecture with masked loads.

After the patch is applied, the above code generates this output
(-march=armv8.2-a+sve -O2 -ftree-vectorize):

0000000000000000 <f>:
   0:   7100007f        cmp     w3, #0x0
   4:   540002cd        b.le    5c <f+0x5c>
   8:   51000464        sub     w4, w3, #0x1
   c:   d2800003        mov     x3, #0x0                        // #0
  10:   90000005        adrp    x5, 0 <f>
  14:   25d8e3e0        ptrue   p0.d
  18:   53017c84        lsr     w4, w4, #1
  1c:   910000a5        add     x5, x5, #0x0
  20:   11000484        add     w4, w4, #0x1
  24:   85c0e0a1        ld1rd   {z1.d}, p0/z, [x5]
  28:   2598e3e3        ptrue   p3.s
  2c:   d37ff884        lsl     x4, x4, #1
  30:   25a41fe2        whilelo p2.s, xzr, x4
  34:   d503201f        nop
  38:   a5434820        ld1w    {z0.s}, p2/z, [x1, x3, lsl #2]
  3c:   25808c11        cmpne   p1.s, p3/z, z0.s, #0
  40:   25808810        cmpne   p0.s, p2/z, z0.s, #0
  44:   a5434040        ld1w    {z0.s}, p0/z, [x2, x3, lsl #2]
  48:   05a1c400        sel     z0.s, p1, z0.s, z1.s
  4c:   e5434800        st1w    {z0.s}, p2, [x0, x3, lsl #2]
  50:   04b0e3e3        incw    x3
  54:   25a41c62        whilelo p2.s, x3, x4
  58:   54ffff01        b.ne    38 <f+0x38>  // b.any
  5c:   d65f03c0        ret


I tested this patch in an aarch64 machine bootstrapping the compiler and
running the checks.

Alejandro

gcc/Changelog:

2019-01-16  Alejandro Martinez  <alejandro.martinezvice...@arm.com>

        * config/aarch64/aarch64-sve.md (copysign<mode>3): New define_expand.
        (xorsign<mode>3): Likewise.
        internal-fn.c: Marked mask_load_direct and mask_store_direct as
        vectorizable.
        tree-data-ref.c (data_ref_compare_tree): Fixed comment typo.
        tree-vect-data-refs.c (can_group_stmts_p): Allow masked loads to be
        combined even if masks different.
        (slp_vect_only_p): New function to detect masked loads that are only
        vectorizable using SLP.
        (vect_analyze_data_ref_accesses): Mark SLP only vectorizable groups.
        tree-vect-loop.c (vect_dissolve_slp_only_groups): New function to
        dissolve SLP-only vectorizable groups when SLP has been discarded.
        (vect_analyze_loop_2): Call vect_dissolve_slp_only_groups when needed.
        tree-vect-slp.c (vect_get_and_check_slp_defs): Check masked loads
        masks.
        (vect_build_slp_tree_1): Fixed comment typo.
        (vect_build_slp_tree_2): Include masks from masked loads in SLP tree.
        tree-vect-stmts.c (vect_get_vec_defs_for_operand): New function to get
        vec_defs for operand with optional SLP and vectype.
        (vectorizable_load): Allow vectorizaion of masked loads for SLP only.
        tree-vectorizer.h (_stmt_vec_info): Added flag for SLP-only
        vectorizable.
        tree-vectorizer.c (vec_info::new_stmt_vec_info): Likewise.

gcc/testsuite/Changelog:
 
2019-01-16  Alejandro Martinez  <alejandro.martinezvice...@arm.com>

        * gcc.target/aarch64/sve/mask_load_slp_1.c: New test for SLP
        vectorized masked loads.

Attachment: mask_load_slp_1.patch
Description: mask_load_slp_1.patch

Reply via email to