https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116575
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |tnfchris at gcc dot gnu.org --- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> --- FAIL: gcc.target/aarch64/sve/mask_struct_load_2.c for example fails because of this. We now correctly do SLP discovery to mask_struct_load_2.c:39:1: note: node 0x4ed9940 (max_nunits=16, refcnt=2) vector([16,16]) signed char mask_struct_load_2.c:39:1: note: op: VEC_PERM_EXPR mask_struct_load_2.c:39:1: note: stmt 0 _7 = .MASK_LOAD (_6, 8B, _21); mask_struct_load_2.c:39:1: note: lane permutation { 0[0] } mask_struct_load_2.c:39:1: note: children 0x4ed99d8 mask_struct_load_2.c:39:1: note: node 0x4ed9e00 (max_nunits=16, refcnt=2) vector([16,16]) signed char mask_struct_load_2.c:39:1: note: op: VEC_PERM_EXPR mask_struct_load_2.c:39:1: note: stmt 0 _11 = .MASK_LOAD (_10, 8B, _21); mask_struct_load_2.c:39:1: note: lane permutation { 0[1] } mask_struct_load_2.c:39:1: note: children 0x4ed99d8 mask_struct_load_2.c:39:1: note: node 0x4ed9f30 (max_nunits=16, refcnt=2) vector([16,16]) signed char mask_struct_load_2.c:39:1: note: op: VEC_PERM_EXPR mask_struct_load_2.c:39:1: note: stmt 0 _16 = .MASK_LOAD (_15, 8B, _21); mask_struct_load_2.c:39:1: note: lane permutation { 0[2] } mask_struct_load_2.c:39:1: note: children 0x4ed99d8 mask_struct_load_2.c:39:1: note: node 0x4ed99d8 (max_nunits=16, refcnt=4) vector([16,16]) signed char mask_struct_load_2.c:39:1: note: op template: _7 = .MASK_LOAD (_6, 8B, _21); mask_struct_load_2.c:39:1: note: stmt 0 _7 = .MASK_LOAD (_6, 8B, _21); mask_struct_load_2.c:39:1: note: stmt 1 _11 = .MASK_LOAD (_10, 8B, _21); mask_struct_load_2.c:39:1: note: stmt 2 _16 = .MASK_LOAD (_15, 8B, _21); mask_struct_load_2.c:39:1: note: children 0x4ed9a70 but this representation is not marked as ->ldst_p - it doesn't require further lowering (there's no permute on the actual load) and that's what currently sets the want-to-use-load-lanes flag. For masked load lanes we need some other place setting this - it could be as late as during permute optimization (where we conveniently have backward edges for the SLP graph). I do not want to set the flag during SLP discovery (which now splits nodes as seen above). FAIL: gcc.target/aarch64/sve/mask_struct_load_1.c fails the same way though IMO questionable whether ld2 is really profitable here.