https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118984
--- Comment #13 from Maxim Egorushkin <maxim.yegorushkin at gmail dot com> --- (In reply to Andrew Pinski from comment #11) > Let me try again: > > So we have: > __v4di v4 = ymm0 > __v2di tmp = _mm256_extracti128_si256(v4, 1); // vextracti128 > __v2di tmp1 = _mm256_castsi256_si128(v4); // subreg > __v2di v2 = tmp + tmp1; > __v2di v3 = _mm_shuffle_epi32(v2, 0b1110); > __v2di res = v2 + v3; > > > So the register allocator allocates res to xmm0. > then v3 to xmm1, and v2 to xmm0. > > And then this is where the problem comes in. > tmp gets allocated to xmm0 and tmp1 gets allocated to xmm1. > Then you will need a move from ymm0 to ymm1 (or xmm0 to xmm1) added because > tmp needs to be in xmm0 but v4 was in ymm0. > > This is why the extra register copy (move) comes from. The zeroing effect of > the instruction is just a side effect of the instruction rather than > anything else. > > > Now if the register allocator allocates tmp to xmm1 and tmp1 to xmm0, then > there is no extra move needed and there is no conflict between xmm0 and v4. > > Does that make sense now? I can follow the cause and effect chain you describe, thank you for taking your time to explain it to me, Andrew. But I still cannot understand how allocating 16 available xmm registers to 6 intermediate results of the instructions involved here is an NP hard problem. If another available xmm register were allocated for each intermediate value in the most naive fashion here, that wouldn't require unnecessary register moves, would it? If the register allocation cannot solve a seemingly NP easy problem, is it reasonable to expect that it does a better job on a truly hard NP problem? Do such sub-optimal register allocation decisions, in seemingly trivial functions like these, cancel each other in functions with more variables to allocate to fewer available registers? Or do sub-optimal register allocation decisions add up and escalate into larger problems instead of cancelling each other? What, ultimately, gives confidence that despite sub-optimal register allocations for small trivial functions, in more complex functions with more state to maintain, the register allocation does a better job?