Hi, Current vectorizer doesn't support masked loads for SLP. We should add that, to allow things like:
void f (int *restrict x, int *restrict y, int *restrict z, int n) { for (int i = 0; i < n; i += 2) { x[i] = y[i] ? z[i] : 1; x[i + 1] = y[i + 1] ? z[i + 1] : 2; } } to be vectorized using contiguous loads rather than LD2 and ST2. This patch was motivated by SVE, but it is completely generic and should apply to any architecture with masked loads. After the patch is applied, the above code generates this output (-march=armv8.2-a+sve -O2 -ftree-vectorize): 0000000000000000 <f>: 0: 7100007f cmp w3, #0x0 4: 540002cd b.le 5c <f+0x5c> 8: 51000464 sub w4, w3, #0x1 c: d2800003 mov x3, #0x0 // #0 10: 90000005 adrp x5, 0 <f> 14: 25d8e3e0 ptrue p0.d 18: 53017c84 lsr w4, w4, #1 1c: 910000a5 add x5, x5, #0x0 20: 11000484 add w4, w4, #0x1 24: 85c0e0a1 ld1rd {z1.d}, p0/z, [x5] 28: 2598e3e3 ptrue p3.s 2c: d37ff884 lsl x4, x4, #1 30: 25a41fe2 whilelo p2.s, xzr, x4 34: d503201f nop 38: a5434820 ld1w {z0.s}, p2/z, [x1, x3, lsl #2] 3c: 25808c11 cmpne p1.s, p3/z, z0.s, #0 40: 25808810 cmpne p0.s, p2/z, z0.s, #0 44: a5434040 ld1w {z0.s}, p0/z, [x2, x3, lsl #2] 48: 05a1c400 sel z0.s, p1, z0.s, z1.s 4c: e5434800 st1w {z0.s}, p2, [x0, x3, lsl #2] 50: 04b0e3e3 incw x3 54: 25a41c62 whilelo p2.s, x3, x4 58: 54ffff01 b.ne 38 <f+0x38> // b.any 5c: d65f03c0 ret I tested this patch in an aarch64 machine bootstrapping the compiler and running the checks. Alejandro gcc/Changelog: 2019-01-16 Alejandro Martinez <alejandro.martinezvice...@arm.com> * config/aarch64/aarch64-sve.md (copysign<mode>3): New define_expand. (xorsign<mode>3): Likewise. internal-fn.c: Marked mask_load_direct and mask_store_direct as vectorizable. tree-data-ref.c (data_ref_compare_tree): Fixed comment typo. tree-vect-data-refs.c (can_group_stmts_p): Allow masked loads to be combined even if masks different. (slp_vect_only_p): New function to detect masked loads that are only vectorizable using SLP. (vect_analyze_data_ref_accesses): Mark SLP only vectorizable groups. tree-vect-loop.c (vect_dissolve_slp_only_groups): New function to dissolve SLP-only vectorizable groups when SLP has been discarded. (vect_analyze_loop_2): Call vect_dissolve_slp_only_groups when needed. tree-vect-slp.c (vect_get_and_check_slp_defs): Check masked loads masks. (vect_build_slp_tree_1): Fixed comment typo. (vect_build_slp_tree_2): Include masks from masked loads in SLP tree. tree-vect-stmts.c (vect_get_vec_defs_for_operand): New function to get vec_defs for operand with optional SLP and vectype. (vectorizable_load): Allow vectorizaion of masked loads for SLP only. tree-vectorizer.h (_stmt_vec_info): Added flag for SLP-only vectorizable. tree-vectorizer.c (vec_info::new_stmt_vec_info): Likewise. gcc/testsuite/Changelog: 2019-01-16 Alejandro Martinez <alejandro.martinezvice...@arm.com> * gcc.target/aarch64/sve/mask_load_slp_1.c: New test for SLP vectorized masked loads.
mask_load_slp_1.patch
Description: mask_load_slp_1.patch