Hello Vladimir, On s390x I have seen some testcase where IRA goes ballistic and loads a value from stack (160(%r15)) over and over again:
[...] 82: e3 80 f0 a0 00 04 lg %r8,160(%r15) <-- 88: e3 b0 f0 a0 00 04 lg %r11,160(%r15) <-- 8e: e3 c0 f0 a0 00 04 lg %r12,160(%r15) <-- 94: e3 90 f0 a0 00 04 lg %r9,160(%r15) <-- 9a: e3 10 f0 a0 00 04 lg %r1,160(%r15) <-- a0: e3 30 f0 a0 00 04 lg %r3,160(%r15) <-- a6: e3 70 80 00 00 95 llh %r7,0(%r8) ac: e3 00 b0 06 00 95 llh %r0,6(%r11) b2: e3 a0 c0 08 00 95 llh %r10,8(%r12) b8: e3 80 90 0a 00 95 llh %r8,10(%r9) be: e3 50 10 02 00 95 llh %r5,2(%r1) c4: b9 04 00 42 lgr %r4,%r2 c8: e3 20 30 04 00 95 llh %r2,4(%r3) [...] Afterwards all the six addresses are used immidiately as a base address for multiple memory accesses. So this testcases triggers 5 unnecessary loads from stack (and might even cause some delay due to address generation in the pipeline as the bypass stack has a limited amount of entries). The smallest testcase I could create out of the exisiting code is -------------- snip ------------------------------ struct dummy { int a; int b; } d; static unsigned short *(*func) (unsigned short *,int, int, int, int); extern int *field; extern int sum; extern unsigned short *p1, *p2; void tester(void) { unsigned short blocks[256], *orgp, *refp; int y, z; int part; unsigned short *x; int apply = ((d.a && (d.b == 0 || d.b == 1)) || d.b == 0); if (apply) x = p1; else x = p2; orgp = blocks; for (y = 0; y < 3; y++) { part = 0; for (z = 0; z < 3; z++) { refp = func(x, 0, 1, 2, 3); part += field[*refp++ - *orgp++]; part += field[*refp++ - *orgp++]; part += field[*refp++ - *orgp++]; part += field[*refp++ - *orgp++]; part += field[*refp++ - *orgp++]; part += field[*refp++ - *orgp++]; part += field[*refp++ - *orgp++]; part += field[*refp++ - *orgp++]; } sum = part*4; } } ------------- snip ------------------------ and if compiled on s390x with -march=z9-109 -mtune=z10 -funroll-loops --param max-unrolled-insns=100 -O3 gcc creates the sequence above. The unrolling seems to be necessary to trigger the right amount of register pressure. Looking at the dumps in 186r.sched we still have memory accesses from address r103+2*x [...] (insn 65 61 72 8 tester.c:34 (set (reg:SI 457) (zero_extend:SI (mem:HI (reg/v/f:DI 103 [ orgp ]) [2 S2 A16]))) 166 {*zero_extendhisi2_extimm} (nil)) (insn 72 65 79 8 tester.c:35 (set (reg:SI 462) (zero_extend:SI (mem:HI (plus:DI (reg/v/f:DI 103 [ orgp ]) (const_int 2 [0x2])) [2 S2 A16]))) 166 {*zero_extendhisi2_extimm} (nil)) (insn 79 72 86 8 tester.c:36 (set (reg:SI 467) (zero_extend:SI (mem:HI (plus:DI (reg/v/f:DI 103 [ orgp ]) (const_int 4 [0x4])) [2 S2 A16]))) 166 {*zero_extendhisi2_extimm} (nil)) (insn 86 79 93 8 tester.c:37 (set (reg:SI 472) (zero_extend:SI (mem:HI (plus:DI (reg/v/f:DI 103 [ orgp ]) (const_int 6 [0x6])) [2 S2 A16]))) 166 {*zero_extendhisi2_extimm} (nil)) [...] and so on which then gets all the additional loads in the 187r.ira step. [...] (insn 322 61 65 8 tester.c:34 (set (reg:DI 12 %r12) (mem/c:DI (plus:DI (reg/f:DI 15 %r15) (const_int 160 [0xa0])) [8 %sfp+-624 S8 A64])) 62 {*movdi_64} (nil)) (insn 65 322 323 8 tester.c:34 (set (reg:SI 12 %r12) (zero_extend:SI (mem:HI (reg:DI 12 %r12) [2 S2 A16]))) 166 {*zero_extendhisi2_extimm} (nil)) (insn 323 65 324 8 tester.c:34 (set (mem/c:SI (plus:DI (reg/f:DI 15 %r15) (const_int 176 [0xb0])) [8 %sfp+-608 S4 A64]) (reg:SI 12 %r12)) 66 {*movsi_zarch} (nil)) (insn 324 323 72 8 tester.c:35 (set (reg:DI 1 %r1) (mem/c:DI (plus:DI (reg/f:DI 15 %r15) (const_int 160 [0xa0])) [8 %sfp+-624 S8 A64])) 62 {*movdi_64} (nil)) (insn 72 324 325 8 tester.c:35 (set (reg:SI 1 %r1) (zero_extend:SI (mem:HI (plus:DI (reg:DI 1 %r1) (const_int 2 [0x2])) [2 S2 A16]))) 166 {*zero_extendhisi2_extimm} (nil)) (insn 325 72 326 8 tester.c:35 (set (mem/c:SI (plus:DI (reg/f:DI 15 %r15) (const_int 192 [0xc0])) [8 %sfp+-592 S4 A32]) (reg:SI 1 %r1)) 66 {*movsi_zarch} (nil)) (insn 326 325 79 8 tester.c:36 (set (reg:DI 2 %r2) (mem/c:DI (plus:DI (reg/f:DI 15 %r15) (const_int 160 [0xa0])) [8 %sfp+-624 S8 A64])) 62 {*movdi_64} (nil)) (insn 79 326 327 8 tester.c:36 (set (reg:SI 2 %r2) (zero_extend:SI (mem:HI (plus:DI (reg:DI 2 %r2) (const_int 4 [0x4])) [2 S2 A16]))) 166 {*zero_extendhisi2_extimm} (nil)) (insn 327 79 328 8 tester.c:36 (set (mem/c:SI (plus:DI (reg/f:DI 15 %r15) (const_int 196 [0xc4])) [8 %sfp+-588 S4 A32]) (reg:SI 2 %r2)) 66 {*movsi_zarch} (nil)) (insn 328 327 86 8 tester.c:37 (set (reg:DI 3 %r3) (mem/c:DI (plus:DI (reg/f:DI 15 %r15) (const_int 160 [0xa0])) [8 %sfp+-624 S8 A64])) 62 {*movdi_64} (nil)) (insn 86 328 329 8 tester.c:37 (set (reg:SI 3 %r3) (zero_extend:SI (mem:HI (plus:DI (reg:DI 3 %r3) (const_int 6 [0x6])) [2 S2 A16]))) 166 {*zero_extendhisi2_extimm} (nil)) [...] If you look at the reload pass, then you see that the original register 103 is replaced by register 589 (probably for life range splitting) _after_ the disposition was done: [...] New iteration of spill/restore move Changing RTL for loop 1 (header bb8) 13 vs parent 13: Creating newreg=589 from oldreg=103 [...] Both 103 and 589 seems to have pretty high costs assigned but 589 is spilled anyway: [...] changing reg in insn 265 Assigning 589(freq=6833) a new slot 20 Register 589 now on stack. [...] Ignoring that 589 is spilled to stack, all other local decisions seem to make sense to me. Any idea how to improve the allocation? Does the fact that 589 is created after the disposition was made has an effect on the spilling decision? If you need one of the dump files, let me know. Christian