https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79826

            Bug ID: 79826
           Summary: Unnecessary spills in vectorised loop version
           Product: gcc
           Version: 7.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: ktkachov at gcc dot gnu.org
                CC: amker at gcc dot gnu.org, rguenth at gcc dot gnu.org
  Target Milestone: ---

Created attachment 40877
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40877&action=edit
Testcase

The attached testcase is a sort of unrolled memcpy.
The vectorised version of the loop has suboptimal register allocation (-O3 on
aarch64):
.L4:
        ldr     q31, [x0]
        add     x2, x2, 1
        cmp     x2, x8
        ldr     q30, [x0, 16]
        ldr     q29, [x0, 32]
        ldr     q28, [x0, 48]
        ldr     q27, [x0, 64]
        ldr     q26, [x0, 80]
        ldr     q25, [x0, 96]
        ldr     q24, [x0, 112]
        ldr     q23, [x0, 128]
        ldr     q22, [x0, 144]
        ldr     q21, [x0, 160]
        ldr     q20, [x0, 176]
        ldr     q19, [x0, 192]
        ldr     q18, [x0, 208]
        ldr     q17, [x0, 224]
        ldr     q16, [x0, 240]
        ldr     q15, [x0, 256]
        add     x0, x0, 528
        ldr     q14, [x0, -256]
        ldr     q13, [x0, -240]
        ldr     q12, [x0, -224]
        ldr     q11, [x0, -208]
        ldr     q10, [x0, -192]
        ldr     q9, [x0, -176]
        ldr     q8, [x0, -160]
        ldr     q7, [x0, -144]
        ldr     q6, [x0, -128]
        ldr     q5, [x0, -112]
        ldr     q4, [x0, -96]
        ldr     q3, [x0, -80]
        ldr     q2, [x0, -64]
        ldr     q1, [x0, -48]
        ldr     q0, [x0, -32]
        str     q0, [sp, 64] //<------- splilling!
        ldr     q0, [x0, -16]
        str     q31, [x1]
        str     q30, [x1, 16]
        str     q29, [x1, 32]
        str     q28, [x1, 48]
        str     q27, [x1, 64]
        str     q26, [x1, 80]
        str     q25, [x1, 96]
        str     q24, [x1, 112]
        str     q23, [x1, 128]
        str     q22, [x1, 144]
        str     q21, [x1, 160]
        str     q20, [x1, 176]
        str     q19, [x1, 192]
        str     q18, [x1, 208]
        str     q17, [x1, 224]
        str     q16, [x1, 240]
        str     q15, [x1, 256]
        add     x1, x1, 528
        str     q14, [x1, -256]
        str     q13, [x1, -240]
        str     q12, [x1, -224]
        str     q11, [x1, -208]
        str     q10, [x1, -192]
        str     q9, [x1, -176]
        str     q8, [x1, -160]
        str     q7, [x1, -144]
        str     q6, [x1, -128]
        str     q5, [x1, -112]
        str     q4, [x1, -96]
        str     q3, [x1, -80]
        str     q2, [x1, -64]
        str     q1, [x1, -48]
        ldr     q31, [sp, 64]
        str     q0, [x1, -16]
        str     q31, [x1, -32]
        bcc     .L4

It uses too many registers and ends up spilling where really it shouldn't. It
could just load and store one or two vector registers at a time and also
interleave the loads and stores.

The problem is that the scheduler cannot interleave the loads and stores.
With -fsched-verbose=5 in the sched1 dump I see that there is an unexpected
dependency between the stores and the last load in the load section:
;;   --- Region Dependences --- b 5 bb 0 
;;      insn  code    bb   dep  prio  cost   reservation
;;      ----  ----    --   ---  ----  ----   -----------
<more loads>
;;      161  1016     5     0     8     6   ca57_load_model     : 212m 208 201
171n 
;;      163  1016     5     0     8     6   ca57_load_model     : 212m 208 202
171n 
;;      165  1016     5     0     8     6   ca57_load_model     : 212m 208 203
171n 
;;      167  1016     5     0     8     6   ca57_load_model     : 212m 208 204
171n 
;;      169  1016     5     0     7     6   ca57_load_model     : 212m 208 205
171n 
;;      171  1016     5    32     7     6   ca57_load_model     : 212m 208 206
205nm 204nm 203nm 202nm 201nm 200nm 199nm 198nm 197nm 196nm 195nm 194nm 193nm
192nm 191nm 190nm 189nm 188nm 187nm 186nm 185nm 184nm 183nm 182nm 181nm 180nm
179nm 178nm 177nm 176nm 175nm 174nm 
;;      174  1016     5     2     2     0   ca57_store_model    : 212 209 205n 
;;      175  1016     5     2     2     0   ca57_store_model    : 212 209 205n 
;;      176  1016     5     2     2     0   ca57_store_model    : 212 209 205n 
;;      177  1016     5     2     2     0   ca57_store_model    : 212 209 205n 
;;      178  1016     5     2     2     0   ca57_store_model    : 212 209 205n 
;;      179  1016     5     2     2     0   ca57_store_model    : 212 209 205n 
;;      180  1016     5     2     2     0   ca57_store_model    : 212 209 205n 
;;      181  1016     5     2     2     0   ca57_store_model    : 212 209 205n 
<more stores>

Note how the last vector load (insn 171) says that all the stores following it
depend on it, so the scheduler cannot reorder any of the stores past it, which
hurts the live range reduction algorithms and leads to suboptimal register
allocation.

I think the vectoriser tagged the loading part of the section with some alias
set that makes the stores conflict with it, even though this is the versioned
part of the loop that assumes the source and destinations don't alias.

If I mark the p and dst pointers as restrict then this problem doesn't appear,
but I think even without restrict since the vectorised version already assumes
the pointers don't alias it shouldn't exhibit this problem

Reply via email to