https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64793
Oleg Endo <olegendo at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |kkojima at gcc dot gnu.org --- Comment #1 from Oleg Endo <olegendo at gcc dot gnu.org> --- This is caused by the fake annulled conditional true branches. Applying this: Index: gcc/config/sh/sh.md =================================================================== --- gcc/config/sh/sh.md (revision 220708) +++ gcc/config/sh/sh.md (working copy) @@ -593,20 +593,9 @@ [(and (eq_attr "in_delay_slot" "yes") (eq_attr "type" "!pstore,prget")) (nil) (nil)]) -;; Say that we have annulled true branches, since this gives smaller and -;; faster code when branches are predicted as not taken. - -;; ??? The non-annulled condition should really be "in_delay_slot", -;; but insns that can be filled in non-annulled get priority over insns -;; that can only be filled in anulled. - (define_delay - (and (eq_attr "type" "cbranch") - (match_test "TARGET_SH2")) - ;; SH2e has a hardware bug that pretty much prohibits the use of - ;; annulled delay slots. - [(eq_attr "cond_delay_slot" "yes") (and (eq_attr "cond_delay_slot" "yes") - (not (eq_attr "cpu" "sh2e"))) (nil)]) + (and (eq_attr "type" "cbranch") (match_test "TARGET_SH2")) + [(eq_attr "cond_delay_slot" "yes") (nil) (nil)]) ;; ------------------------------------------------------------------------- ;; SImode signed integer comparisons results in the expected code: mov r5,r0 mov.b @(r0,r4),r1 mov r1,r0 cmp/eq #92,r0 bt .L3 rts mov r7,r0 .align 1 .L3: rts mov r6,r0 The downside is that code size increases on average. CSiBE shows a total increase 3371399 -> 3372451 +1052 / +0.031204 % even though there are also individual code size decreases. It also seems that this catches more missed cases of cbranches with delay slot: blocksort.c (fallbackSort): before: .L275: cmp/pl r3 bf .L23 mov.l @(28,r15),r4 mov #0,r0 mov.l @(16,r15),r2 after: .L275: cmp/pl r3 bf/s .L23 mov #0,r0 mov.l @(28,r15),r4 mov.l @(16,r15),r2 The code size increase is caused by duplicated insns such as: before: bf .L315 ... bf .L315 ... bf .L315 ... .L315: cmp/hi r13,r12 bra .L308 movt r0 after: bf/s .L322 cmp/hi r13,r12 ... bf/s .L322 cmp/hi r13,r12 ... bf/s .L322 cmp/hi r13,r12 ... .L322: bra .L307 movt r0 In a similar way, the builtin strcmp code results in sequences such as: bt/s .L67 sett mov.b @r1+,r2 tst r2,r2 bt/s .L67 sett The sh_optimize_sett_clrt pass does not eliminate the sett insn because T is not the same value in all paths and thus it gets copied into the delay slots. There's an old comment from r9888 ;; Say that we have annulled true branches, since this gives smaller and ;; faster code when branches are predicted as not taken. I don't know what this comment is based on. Branch prediction was added on SH4A, which was long time after that comment. Maybe it refers to the fact that conditional branches are faster on SH when they are not taken. Public SH2 documentation states that (bf/s, bt/s) are 2 cycles and (bt, bf) are 3 cycles. In both cases the branch insns take 1 cycle if they don't branch. Looking at other documentation (ST40-300, SH4A), it seems that using the delay-slot variants has a higher chance of executing the branch and delay-slot insn in parallel. Kaz, if you have some time, could you please do a CSiBE runtime comparison with/without the patch above? I'm tempted to apply the patch above and drop the fake annulled delay slot insns.