In some cases, the work of the cse1 pass is counterproductive, as we noticed on s390x. The effect described below is present since at least 4.8.0. Note that this may not become manifest in a performance issue problem on all platforms. Also note that -O1 does not show this behaviour because the responsible code is only executed with -O2 or higher.
The core of the problem is the was cse1 sometimes handles function parameters. Roughly, the observed situation is Before cse1 start of function set pseudoreg Rp to the first argument from hardreg R2 (some code that uses Rp) set R2 to Rp After cse1: start of function set pseudoreg Rp to the first argument from hardreg R2 (some code that uses Rp) <--- The use of Rp is still present set R2 to R2 <--- cse1 has replaced Rp with R2 After that, the set pattern is removed completely, and now we have both, Rp and R2 live in the drafted code snippet. Because R2 ist still supposed to be live later on, the ira pass chooses a different hard register (R1) for Rp, and code to copy R1 back to R2 is added later. (See further down for Rtl and assembly code.) -- There seems to be code to prevent this in cse.c:hash_rtx_cb() as a comment from that function suggests: /* On some machines, we can't record any non-fixed hard register, because extending its life will cause reload problems. We consider ap, fp, sp, gp to be fixed for this purpose. ... Unfortunately this is not caused by hashing but by the code dealing with src_related in cse_insn(). When cse_insn() handles the "copy Rp to R2" instruction, it does nothing up to line 5020 and sets src_related there: /* This is the same as the destination of the insns, we want to prefer it. Copy it to src_related. The code below will then give it a negative cost. */ if (GET_CODE (dest) == code && rtx_equal_p (p->exp, dest)) src_related = dest; Eventually, the term src_related is used to replace the source expression of the set pattern. So, while the above comment may be applicable to hashed expressions that are considered for replacement, there's no such "safety net" for the expressions src_related, src_folded etc. I guess if there was, that would fix the issue. -- So, I've made an experimental hack (see attachment) and treid that. In a larger test suite, register copies could be saved in quite some places (including the test program below), but in other places new register copies were introduced, resulting in about twice as much "issues" as without the patch. Maybe the patch is just too coarse. In general I'd assume that the register allocator does a better job of assigning hard registers to pseudo registers. Is it possible to better describe when cse1 should keep its hands off pseudo registers? -- Example (complete filles in the attached .tgz file): --- foo.c --- extern long bar(long y); long foo(long x, long z) { if (x != 0) x = bar(x); return x + z; } --- foo.c --- Assembly output with three superfluous instructions: --- foo.s --- foo: stmg %r12,%r15,96(%r15) aghi %r15,-160 ltgr %r1,%r2 <--- unnecessary copy of r2 to r1 lgr %r12,%r3 je .L2 brasl %r14,bar <--- r2 th as argument, return value in r2 lgr %r1,%r2 <--- unnecessary copy of r2 to r1 .L2: agr %r1,%r12 <--- could use r2 instead of r1 lg %r4,272(%r15) lgr %r2,%r1 <--- unnecessary copy of r1 to r2 lmg %r12,%r15,256(%r15) br %r4 --- foo.s --- Rtl before cse1 pass: --- foo.c.195r.dfinit --- (insn 2 5 3 2 (set (reg/v:DI 61 [ x ]) (reg:DI 2 %r2 [ x ])) /home/vogt/foo.c:3 897 {*movdi_64} (nil)) ... (insn 7 4 8 2 (set (reg:CCZ 33 %cc) (compare:CCZ (reg/v:DI 61 [ x ]) (const_int 0 [0]))) /home/vogt/foo.c:4 846 {*tstdi_cconly2} (nil)) (jump_insn 8 7 9 2 (set (pc) (if_then_else (eq (reg:CCZ 33 %cc) (const_int 0 [0])) (label_ref 13) (pc))) /home/vogt/foo.c:4 1440 {*cjump_64} (int_list:REG_BR_PROB 6102 (nil)) -> 13) ... (insn 10 9 11 3 (set (reg:DI 2 %r2) (reg/v:DI 61 [ x ])) /home/vogt/foo.c:5 897 {*movdi_64} (nil)) (call_insn 11 10 12 3 (parallel [ (set (reg:DI 2 %r2) (call (mem:QI (symbol_ref:DI ("bar") [flags 0x41] <function_decl 0x3fff08a7c00 bar>) [0 bar S1 A8]) (const_int 0 [0]))) (clobber (reg:DI 14 %r14)) ]) /home/vogt/foo.c:5 1480 {*brasl_r} (expr_list:REG_CALL_DECL (symbol_ref:DI ("bar") [flags 0x41] <function_decl 0x3fff08a7c00 bar>) (nil)) (expr_list:DI (use (reg:DI 2 %r2)) (nil))) (insn 12 11 13 3 (set (reg/v:DI 61 [ x ]) (reg:DI 2 %r2)) /home/vogt/foo.c:5 897 {*movdi_64} (nil)) (code_label 13 12 14 4 2 "" [1 uses]) --- foo.c.195r.dfinit --- Cse1 replaces r61 with r2 in just one insn: --- foo.c.196r.cse1 --- (insn 10 9 11 3 (set (reg:DI 2 %r2) (reg:DI 2 %r2)) /home/vogt/foo.c:5 897 {*movdi_64} (expr_list:REG_DEAD (reg/v:DI 61 [ x ]) (nil))) --- foo.c.196r.cse1 --- Ciao Dominik ^_^ ^_^ -- Dominik Vogt IBM Germany
foo.tgz
Description: application/gtar-compressed
>From a7fc5b13c7f0eb59e0c0d1389d031acfae3f7c6b Mon Sep 17 00:00:00 2001 From: Dominik Vogt <v...@linux.vnet.ibm.com> Date: Mon, 12 Oct 2015 13:05:46 +0100 Subject: [PATCH] !!!attempt fix --- gcc/cse.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/gcc/cse.c b/gcc/cse.c index a9cc26a..74daf12 100644 --- a/gcc/cse.c +++ b/gcc/cse.c @@ -4600,10 +4600,16 @@ cse_insn (rtx_insn *insn) /* Set nonzero if we need to call force_const_mem on with the contents of src_folded before using it. */ int src_folded_force_flag = 0; + bool set_hard_reg_from_pseudo; dest = SET_DEST (sets[i].rtl); src = SET_SRC (sets[i].rtl); + set_hard_reg_from_pseudo = (src && REG_P (src) + && REGNO (src) >= FIRST_PSEUDO_REGISTER + && dest && REG_P (dest) + && REGNO (dest) < FIRST_PSEUDO_REGISTER); + /* If SRC is a constant that has no machine mode, hash it with the destination's machine mode. This way we can keep different modes separate. */ @@ -5168,6 +5174,13 @@ cse_insn (rtx_insn *insn) src_elt_cost = MAX_COST; } + if (set_hard_reg_from_pseudo + && trial && REG_P (trial) && REGNO (trial) == REGNO (dest)) + /* The register allocator is expected to do a better job of + replacing pseudo registers with hard registers a la + (set hardreg<n> pseudoreg<m>) -> (set hardreg<n> hardreg<n>) */ + continue; + /* Avoid creation of overlapping memory moves. */ if (MEM_P (trial) && MEM_P (SET_DEST (sets[i].rtl))) { -- 2.3.0