https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79826
Bug ID: 79826
Summary: Unnecessary spills in vectorised loop version
Product: gcc
Version: 7.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: ktkachov at gcc dot gnu.org
CC: amker at gcc dot gnu.org, rguenth at gcc dot gnu.org
Target Milestone: ---
Created attachment 40877
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40877&action=edit
Testcase
The attached testcase is a sort of unrolled memcpy.
The vectorised version of the loop has suboptimal register allocation (-O3 on
aarch64):
.L4:
ldr q31, [x0]
add x2, x2, 1
cmp x2, x8
ldr q30, [x0, 16]
ldr q29, [x0, 32]
ldr q28, [x0, 48]
ldr q27, [x0, 64]
ldr q26, [x0, 80]
ldr q25, [x0, 96]
ldr q24, [x0, 112]
ldr q23, [x0, 128]
ldr q22, [x0, 144]
ldr q21, [x0, 160]
ldr q20, [x0, 176]
ldr q19, [x0, 192]
ldr q18, [x0, 208]
ldr q17, [x0, 224]
ldr q16, [x0, 240]
ldr q15, [x0, 256]
add x0, x0, 528
ldr q14, [x0, -256]
ldr q13, [x0, -240]
ldr q12, [x0, -224]
ldr q11, [x0, -208]
ldr q10, [x0, -192]
ldr q9, [x0, -176]
ldr q8, [x0, -160]
ldr q7, [x0, -144]
ldr q6, [x0, -128]
ldr q5, [x0, -112]
ldr q4, [x0, -96]
ldr q3, [x0, -80]
ldr q2, [x0, -64]
ldr q1, [x0, -48]
ldr q0, [x0, -32]
str q0, [sp, 64] //<------- splilling!
ldr q0, [x0, -16]
str q31, [x1]
str q30, [x1, 16]
str q29, [x1, 32]
str q28, [x1, 48]
str q27, [x1, 64]
str q26, [x1, 80]
str q25, [x1, 96]
str q24, [x1, 112]
str q23, [x1, 128]
str q22, [x1, 144]
str q21, [x1, 160]
str q20, [x1, 176]
str q19, [x1, 192]
str q18, [x1, 208]
str q17, [x1, 224]
str q16, [x1, 240]
str q15, [x1, 256]
add x1, x1, 528
str q14, [x1, -256]
str q13, [x1, -240]
str q12, [x1, -224]
str q11, [x1, -208]
str q10, [x1, -192]
str q9, [x1, -176]
str q8, [x1, -160]
str q7, [x1, -144]
str q6, [x1, -128]
str q5, [x1, -112]
str q4, [x1, -96]
str q3, [x1, -80]
str q2, [x1, -64]
str q1, [x1, -48]
ldr q31, [sp, 64]
str q0, [x1, -16]
str q31, [x1, -32]
bcc .L4
It uses too many registers and ends up spilling where really it shouldn't. It
could just load and store one or two vector registers at a time and also
interleave the loads and stores.
The problem is that the scheduler cannot interleave the loads and stores.
With -fsched-verbose=5 in the sched1 dump I see that there is an unexpected
dependency between the stores and the last load in the load section:
;; --- Region Dependences --- b 5 bb 0
;; insn code bb dep prio cost reservation
;; ---- ---- -- --- ---- ---- -----------
<more loads>
;; 161 1016 5 0 8 6 ca57_load_model : 212m 208 201
171n
;; 163 1016 5 0 8 6 ca57_load_model : 212m 208 202
171n
;; 165 1016 5 0 8 6 ca57_load_model : 212m 208 203
171n
;; 167 1016 5 0 8 6 ca57_load_model : 212m 208 204
171n
;; 169 1016 5 0 7 6 ca57_load_model : 212m 208 205
171n
;; 171 1016 5 32 7 6 ca57_load_model : 212m 208 206
205nm 204nm 203nm 202nm 201nm 200nm 199nm 198nm 197nm 196nm 195nm 194nm 193nm
192nm 191nm 190nm 189nm 188nm 187nm 186nm 185nm 184nm 183nm 182nm 181nm 180nm
179nm 178nm 177nm 176nm 175nm 174nm
;; 174 1016 5 2 2 0 ca57_store_model : 212 209 205n
;; 175 1016 5 2 2 0 ca57_store_model : 212 209 205n
;; 176 1016 5 2 2 0 ca57_store_model : 212 209 205n
;; 177 1016 5 2 2 0 ca57_store_model : 212 209 205n
;; 178 1016 5 2 2 0 ca57_store_model : 212 209 205n
;; 179 1016 5 2 2 0 ca57_store_model : 212 209 205n
;; 180 1016 5 2 2 0 ca57_store_model : 212 209 205n
;; 181 1016 5 2 2 0 ca57_store_model : 212 209 205n
<more stores>
Note how the last vector load (insn 171) says that all the stores following it
depend on it, so the scheduler cannot reorder any of the stores past it, which
hurts the live range reduction algorithms and leads to suboptimal register
allocation.
I think the vectoriser tagged the loading part of the section with some alias
set that makes the stores conflict with it, even though this is the versioned
part of the loop that assumes the source and destinations don't alias.
If I mark the p and dst pointers as restrict then this problem doesn't appear,
but I think even without restrict since the vectorised version already assumes
the pointers don't alias it shouldn't exhibit this problem