https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111005
Bug ID: 111005 Summary: SVE produced code for different type sizes (smaller than int) with comparison in a loop can be improved Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: pinskia at gcc dot gnu.org Target Milestone: --- Target: aarch64 Take: ``` void __attribute__ ((noipa)) f0 (int *__restrict r, int *__restrict a, short *__restrict pred) { for (int i = 0; i < 1024; ++i) { int p = pred[i]?-1:0; r[i] = p ; } } void __attribute__ ((noipa)) f1 (int *__restrict r, int *__restrict a, short *__restrict pred) { for (int i = 0; i < 1024; ++i) { int p = pred[i]; r[i] = p ; } } ``` f1 produces: ``` .L6: ld1sh z31.s, p7/z, [x2, x1, lsl 1] st1w z31.s, p7, [x0, x1, lsl 2] incw x1 whilelo p7.s, w1, w3 b.any .L6 ``` While f0 produces: ``` .L2: ld1h z0.h, p0/z, [x2, x1, lsl 1] punpklo p2.h, p0.b cmpne p3.h, p1/z, z0.h, #0 punpkhi p0.h, p0.b mov z0.h, p3/z, #1 neg z0.h, p1/m, z0.h sunpklo z1.s, z0.h sunpkhi z0.s, z0.h st1w z1.s, p2, [x0, x1, lsl 2] st1w z0.s, p0, [x4, x1, lsl 2] inch x1 whilelo p0.h, w1, w3 b.any .L2 ``` While it should produce: ``` .L6: ld1sh z31.s, p7/z, [x2, x1, lsl 1] cmpne p1.s, p7/z, z31.s, #0 mov z31.s, p1/z, #-1 // =0xffffffffffffffff st1w z31.s, p7, [x0, x1, lsl 2] incw x1 whilelo p7.s, w1, w3 b.any .L6 ``` That is: sign extend load compare-not-equal to 0; setting p1 set z31 to -1 or 0 based on p1 store z31 But instead we push to do unpacking from VN2HI to VHI ...