https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116
--- Comment #17 from amker at gcc dot gnu.org --- (In reply to Andrew Senkevich from comment #16) > (In reply to amker from comment #13) > > We should create another PR for additional copy instructions after my patch > > and close this one. IMHO they are two different issues. > > I agree, currently there are no fills from stack on both testcases for which > this PR was created. > But I have no bugzilla permissions to close it, could somebody from CC close > it please? > > (In reply to Pat Haugen from comment #14) > . . . > > Additional info, it's really just one copy introduced, but becomes 4 after > > unrolling. This is the loop from the first testcase without -funroll-loops. > > Looks like we could get rid of the vmovaps by making zmm2 the dest on the > > vpermps (assuming I'm understanding the asm correctly). > > > > .L26: > > vpermps (%rcx), %zmm10, %zmm1 > > leal 1(%rsi), %esi > > vmovaps %zmm1, %zmm2 > > vmaxps (%r15,%rdx), %zmm3, %zmm1 > > vfnmadd132ps (%r12,%rdx), %zmm7, %zmm2 > > cmpl %esi, %r8d > > leaq -64(%rcx), %rcx > > vmaxps %zmm1, %zmm2, %zmm1 > > vmovups %zmm1, (%rdi,%rdx) > > leaq 64(%rdx), %rdx > > ja .L26 > > Looks like so. For which optimization/analysis we should file ticket for it? Put rtl-optimization would be fine for this. I can close this ticket after new one is created. Thanks.