https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79826
Bug ID: 79826 Summary: Unnecessary spills in vectorised loop version Product: gcc Version: 7.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ktkachov at gcc dot gnu.org CC: amker at gcc dot gnu.org, rguenth at gcc dot gnu.org Target Milestone: --- Created attachment 40877 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40877&action=edit Testcase The attached testcase is a sort of unrolled memcpy. The vectorised version of the loop has suboptimal register allocation (-O3 on aarch64): .L4: ldr q31, [x0] add x2, x2, 1 cmp x2, x8 ldr q30, [x0, 16] ldr q29, [x0, 32] ldr q28, [x0, 48] ldr q27, [x0, 64] ldr q26, [x0, 80] ldr q25, [x0, 96] ldr q24, [x0, 112] ldr q23, [x0, 128] ldr q22, [x0, 144] ldr q21, [x0, 160] ldr q20, [x0, 176] ldr q19, [x0, 192] ldr q18, [x0, 208] ldr q17, [x0, 224] ldr q16, [x0, 240] ldr q15, [x0, 256] add x0, x0, 528 ldr q14, [x0, -256] ldr q13, [x0, -240] ldr q12, [x0, -224] ldr q11, [x0, -208] ldr q10, [x0, -192] ldr q9, [x0, -176] ldr q8, [x0, -160] ldr q7, [x0, -144] ldr q6, [x0, -128] ldr q5, [x0, -112] ldr q4, [x0, -96] ldr q3, [x0, -80] ldr q2, [x0, -64] ldr q1, [x0, -48] ldr q0, [x0, -32] str q0, [sp, 64] //<------- splilling! ldr q0, [x0, -16] str q31, [x1] str q30, [x1, 16] str q29, [x1, 32] str q28, [x1, 48] str q27, [x1, 64] str q26, [x1, 80] str q25, [x1, 96] str q24, [x1, 112] str q23, [x1, 128] str q22, [x1, 144] str q21, [x1, 160] str q20, [x1, 176] str q19, [x1, 192] str q18, [x1, 208] str q17, [x1, 224] str q16, [x1, 240] str q15, [x1, 256] add x1, x1, 528 str q14, [x1, -256] str q13, [x1, -240] str q12, [x1, -224] str q11, [x1, -208] str q10, [x1, -192] str q9, [x1, -176] str q8, [x1, -160] str q7, [x1, -144] str q6, [x1, -128] str q5, [x1, -112] str q4, [x1, -96] str q3, [x1, -80] str q2, [x1, -64] str q1, [x1, -48] ldr q31, [sp, 64] str q0, [x1, -16] str q31, [x1, -32] bcc .L4 It uses too many registers and ends up spilling where really it shouldn't. It could just load and store one or two vector registers at a time and also interleave the loads and stores. The problem is that the scheduler cannot interleave the loads and stores. With -fsched-verbose=5 in the sched1 dump I see that there is an unexpected dependency between the stores and the last load in the load section: ;; --- Region Dependences --- b 5 bb 0 ;; insn code bb dep prio cost reservation ;; ---- ---- -- --- ---- ---- ----------- <more loads> ;; 161 1016 5 0 8 6 ca57_load_model : 212m 208 201 171n ;; 163 1016 5 0 8 6 ca57_load_model : 212m 208 202 171n ;; 165 1016 5 0 8 6 ca57_load_model : 212m 208 203 171n ;; 167 1016 5 0 8 6 ca57_load_model : 212m 208 204 171n ;; 169 1016 5 0 7 6 ca57_load_model : 212m 208 205 171n ;; 171 1016 5 32 7 6 ca57_load_model : 212m 208 206 205nm 204nm 203nm 202nm 201nm 200nm 199nm 198nm 197nm 196nm 195nm 194nm 193nm 192nm 191nm 190nm 189nm 188nm 187nm 186nm 185nm 184nm 183nm 182nm 181nm 180nm 179nm 178nm 177nm 176nm 175nm 174nm ;; 174 1016 5 2 2 0 ca57_store_model : 212 209 205n ;; 175 1016 5 2 2 0 ca57_store_model : 212 209 205n ;; 176 1016 5 2 2 0 ca57_store_model : 212 209 205n ;; 177 1016 5 2 2 0 ca57_store_model : 212 209 205n ;; 178 1016 5 2 2 0 ca57_store_model : 212 209 205n ;; 179 1016 5 2 2 0 ca57_store_model : 212 209 205n ;; 180 1016 5 2 2 0 ca57_store_model : 212 209 205n ;; 181 1016 5 2 2 0 ca57_store_model : 212 209 205n <more stores> Note how the last vector load (insn 171) says that all the stores following it depend on it, so the scheduler cannot reorder any of the stores past it, which hurts the live range reduction algorithms and leads to suboptimal register allocation. I think the vectoriser tagged the loading part of the section with some alias set that makes the stores conflict with it, even though this is the versioned part of the loop that assumes the source and destinations don't alias. If I mark the p and dst pointers as restrict then this problem doesn't appear, but I think even without restrict since the vectorised version already assumes the pointers don't alias it shouldn't exhibit this problem