https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
amker at gcc dot gnu.org changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |amker at gcc dot gnu.org --- Comment #52 from amker at gcc dot gnu.org --- I don't understand powerpc assembly well, but this looks like the same problem on aarch64/arm. Ah, and we are even looking at same function... I think this is a general issue caused by inconsistency between tree level ivopt and rtl level loop unroller. To be specific, how we handle unrolled induction variable registers after unrolling. The core loop on aarch64 with options "-O3 -funroll-all-loops -mcpu=cortex-a57" gave below output: .L3: add x2, x0, 16 ldr q16, [x17, x0] add x10, x0, 32 add x9, x0, 48 add x8, x0, 64 ldr q17, [x17, x2] add x3, x0, 80 add x6, x0, 96 add x5, x0, 112 add w1, w1, 8 ldr q19, [x17, x10] cmp w1, w14 ldr q18, [x17, x9] ldr q20, [x17, x8] ldr q21, [x17, x3] ldr q22, [x17, x6] ldr q23, [x17, x5] str q16, [x18, x0] add x0, x0, 128 str q17, [x18, x2] str q19, [x18, x10] str q18, [x18, x9] str q20, [x18, x8] str q21, [x18, x3] str q22, [x18, x6] str q23, [x18, x5] bcc .L3 The tree ivopt dump is quite neat: <bb 6>: # ivtmp.16_28 = PHI <ivtmp.16_25(9), 0(5)> # ivtmp.19_42 = PHI <ivtmp.19_41(9), 0(5)> vect__4.13_62 = MEM[base: vectp_a.12_58, index: ivtmp.19_42, offset: 0B]; MEM[base: vectp_c.15_63, index: ivtmp.19_42, offset: 0B] = vect__4.13_62; ivtmp.16_25 = ivtmp.16_28 + 1; ivtmp.19_41 = ivtmp.19_42 + 16; if (ivtmp.16_25 < bnd.7_36) goto <bb 9>; else goto <bb 7>; ... <bb 9>: goto <bb 6>; But after rtl unroller, we have options like "-fsplit-ivs-in-unroller" and "-fweb". These two options try to split the long live range of induction vairables into seperated ones. Evetually, with folloing fwprop and IRA, we have multiple ivs for each original iv. I see two possible fixes here. One is to implement a tree level unroller before IVOPT and remove the rtl one. The rtl one is some kind of too aggressive that we didn't enable it by default with "O3". Another is change how we handle unrolled iv in rtl unroller. It splits unrolled iv to avoid pseudo register with long live range since that may affect rtl optimizers. This assumption may hold before, but seems not true to me nowadays, especially for induction variables. Because on tree level ivopts, we already made the assumption that each iv occupies a register, also ivs are intensively used thus should live in one single hard register. For this specific case, we can refactor [base+index] out of memory reference and use [new_base], [new_base+4], [new_base+8], ... etc. in unrolling. If tree ivopts choosses [reg+offset] addressing mode, we only need to generate instruction sequence like "[reg+offset], [reg+(offset+4)], [reg+(offset+8)]... reg = reg + urolled_times*step" Thanks, bin