https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82137
--- Comment #3 from rguenther at suse dot de <rguenther at suse dot de> --- On Tue, 12 Sep 2017, peter at cordes dot ca wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82137 > > --- Comment #2 from Peter Cordes <peter at cordes dot ca> --- > (In reply to Richard Biener from comment #1) > > Interesting idea. It's probably a bit hard to make the vectorizer do this > > though given it's current structure and the fact that it would have to > > cost the extra ops against the saved suffling (the extra cost of the ops > > would depent on availability of spare execution resources for example). > > Shuffles take execution resources just like anything else (except for > load+broadcast (vbroadcastss or movddup, or even movshdup) which is done as > part of a load uop in Ryzen and Intel CPUs). I was concerned about cases where we don't just have one operation but a computation sequence where you go from SHUF + n * op + SHUF to n * 2 * op + BLEND. GCC already knows how to handle a mix of two operators & blend for addsubpd support -- it just does the blending in a possibly unwanted position and nothing later optimizes shuffles through transparent operations. Extending this support to handle plus and mult yields pairs_double: .LFB0: .cfi_startproc leaq 81920(%rdi), %rax .p2align 4,,10 .p2align 3 .L2: vmovapd (%rdi), %ymm0 addq $32, %rdi vpermpd $160, %ymm0, %ymm2 vpermpd $245, %ymm0, %ymm0 vmulpd %ymm2, %ymm0, %ymm1 vaddpd %ymm2, %ymm0, %ymm0 vshufpd $10, %ymm0, %ymm1, %ymm0 vmovapd %ymm0, -32(%rdi) cmpq %rax, %rdi jne .L2 vzeroupper the vectorizer generates <bb 3> [50.00%] [count: INV]: # ivtmp.17_24 = PHI <ivtmp.17_2(2), ivtmp.17_1(3)> _3 = (void *) ivtmp.17_24; vect_x_14.2_23 = MEM[(double *)_3]; vect_x_14.3_22 = VEC_PERM_EXPR <vect_x_14.2_23, vect_x_14.2_23, { 0, 0, 2, 2 }>; vect_y_15.7_10 = VEC_PERM_EXPR <vect_x_14.2_23, vect_x_14.2_23, { 1, 1, 3, 3 }>; vect__7.8_9 = vect_y_15.7_10 * vect_x_14.3_22; vect__7.9_40 = vect_y_15.7_10 + vect_x_14.3_22; _35 = VEC_PERM_EXPR <vect__7.8_9, vect__7.9_40, { 0, 5, 2, 7 }>; MEM[(double *)_3] = _35; ivtmp.17_1 = ivtmp.17_24 + 32; if (ivtmp.17_1 != _5) goto <bb 3>; [98.00%] [count: INV] in this case. Not sure if this is optimal. Now the question is whether it is profitable to enable this for more than the plus/minus combination in general. As I noted GCC does a poor job optimizing for example static const int aligned = 1; void pairs_double(double blocks[]) { if(aligned) blocks = __builtin_assume_aligned(blocks, 64); for (int i = 0 ; i<10240 ; i+=2) { double x = blocks[i]; double y = blocks[i+1]; double tem = x * y; double tem2 = x + y; tem = tem + 3; tem2 = tem2 * 2; blocks[i] = tem; blocks[i+1] = tem2; } } which ends up as <bb 3> [50.00%] [count: INV]: # ivtmp.19_26 = PHI <ivtmp.19_2(2), ivtmp.19_1(3)> _3 = (void *) ivtmp.19_26; vect_x_12.2_31 = MEM[(double *)_3]; vect_x_12.3_30 = VEC_PERM_EXPR <vect_x_12.2_31, vect_x_12.2_31, { 0, 0, 2, 2 }>; vect_y_13.7_24 = VEC_PERM_EXPR <vect_x_12.2_31, vect_x_12.2_31, { 1, 1, 3, 3 }>; vect_tem_14.8_23 = vect_y_13.7_24 * vect_x_12.3_30; vect_tem_14.9_22 = vect_y_13.7_24 + vect_x_12.3_30; _21 = VEC_PERM_EXPR <vect_tem_14.8_23, vect_tem_14.9_22, { 0, 5, 2, 7 }>; vect_tem_16.10_7 = _21 + { 3.0e+0, 2.0e+0, 3.0e+0, 2.0e+0 }; vect_tem_16.11_41 = _21 * { 3.0e+0, 2.0e+0, 3.0e+0, 2.0e+0 }; _42 = VEC_PERM_EXPR <vect_tem_16.10_7, vect_tem_16.11_41, { 0, 5, 2, 7 }>; MEM[(double *)_3] = _42; ivtmp.19_1 = ivtmp.19_26 + 32; if (ivtmp.19_1 != _5) goto <bb 3>; [98.00%] [count: INV] and thus .L2: vmovapd (%rdi), %ymm0 addq $32, %rdi vpermpd $160, %ymm0, %ymm2 vpermpd $245, %ymm0, %ymm0 vmulpd %ymm2, %ymm0, %ymm1 vaddpd %ymm2, %ymm0, %ymm0 vshufpd $10, %ymm0, %ymm1, %ymm0 vaddpd %ymm3, %ymm0, %ymm1 vmulpd %ymm3, %ymm0, %ymm0 vshufpd $10, %ymm0, %ymm1, %ymm0 vmovapd %ymm0, -32(%rdi) cmpq %rax, %rdi jne .L2 we miss some pass that would move and combine VEC_PERM_EXPRs across operations. Thanks for the detailed write-up on how AVX512 works compared to AVX2, it does indeed look like an improvement! (until AMD cripples it with splitting it into two AVX256 halves... ;))