Hello! Vzeroupper placement patch uses MODE_EXIT to determine if vzeroupper has to be placed before function exit. However, when compiling following test
--cut here-- typedef struct objc_class *Class; typedef struct objc_object { Class class_pointer; } *id; typedef const struct objc_selector *SEL; typedef void * retval_t; typedef void * arglist_t; extern retval_t __objc_forward (id object, SEL sel, arglist_t args); double __objc_double_forward (id rcv, SEL op, ...) { void *args, *res; args = __builtin_apply_args (); res = __objc_forward (rcv, op, args); __builtin_return (res); } --cut here-- __builtin_return emits a sequence of loads from "argument area" to all possible function return registers in their widest mode (together with corresponding USE RTXes), creating: (insn 31 30 33 2 (set (reg:DI 0 ax) (mem:DI (reg/v/f:DI 60 [ res ]) [0 S8 A8])) avx-vzeroupper-27.c:23 -1 (nil)) (insn 33 31 35 2 (set (reg:XF 8 st) (mem:XF (plus:DI (reg/v/f:DI 60 [ res ]) (const_int 16 [0x10])) [0 S16 A8])) avx-vzeroupper-27.c:23 -1 (nil)) (insn 35 33 32 2 (set (reg:OI 21 xmm0) (mem:OI (plus:DI (reg/v/f:DI 60 [ res ]) (const_int 32 [0x20])) [0 S32 A8])) avx-vzeroupper-27.c:23 -1 (nil)) (insn 32 35 34 2 (use (reg:DI 0 ax)) avx-vzeroupper-27.c:23 -1 (nil)) (insn 34 32 36 2 (use (reg:XF 8 st)) avx-vzeroupper-27.c:23 -1 (nil)) (insn 36 34 44 2 (use (reg:OI 21 xmm0)) avx-vzeroupper-27.c:23 -1 (nil)) This sequence breaks assumption in mode-switching.c, that final function return register load operates in MODE_EXIT mode and triggere following code: for (j = n_entities - 1; j >= 0; j--) { int e = entity_map[j]; int mode = MODE_NEEDED (e, return_copy); if (mode != num_modes[e] && mode != MODE_EXIT (e)) break; } As discussed above, modes of loads, generated from __builtin_apply have no connection to function return mode. mode-switching.c does detect __builtin_apply situation and raises maybe_builtin_apply flag, but doesn't use it to short-circuit wrong check. In proposed patch, we detect this situation and raise force_late_switch in the same way, as SH4 does for its "late" fpscr emission. For the testcase above, following RTX sequence is generated: (insn 31 30 33 2 (set (reg:DI 0 ax) (mem:DI (reg/v/f:DI 60 [ res ]) [0 S8 A8])) avx-vzeroupper-27.c:23 63 {*movdi_internal_rex64} (nil)) (insn 33 31 35 2 (set (reg:XF 8 st) (mem:XF (plus:DI (reg/v/f:DI 60 [ res ]) (const_int 16 [0x10])) [0 S16 A8])) avx-vzeroupper-27.c:23 106 {*movxf_internal} (nil)) (insn 35 33 52 2 (set (reg:OI 21 xmm0) (mem:OI (plus:DI (reg/v/f:DI 60 [ res ]) (const_int 32 [0x20])) [0 S32 A8])) avx-vzeroupper-27.c:23 60 {*movoi_internal_avx} (expr_list:REG_DEAD (reg/v/f:DI 60 [ res ]) (nil))) (insn 52 35 49 2 (unspec_volatile [ (const_int 9 [0x9]) ] UNSPECV_VZEROUPPER) -1 (nil)) (note 49 52 32 2 NOTE_INSN_DELETED) (insn 32 49 34 2 (use (reg:DI 0 ax)) avx-vzeroupper-27.c:23 -1 (expr_list:REG_DEAD (reg:DI 0 ax) (nil))) (insn 34 32 36 2 (use (reg:XF 8 st)) avx-vzeroupper-27.c:23 -1 (expr_list:REG_DEAD (reg:XF 8 st) (nil))) (insn 36 34 44 2 (use (reg:OI 21 xmm0)) avx-vzeroupper-27.c:23 -1 (nil)) (insn 44 36 0 2 (use (reg/i:DF 21 xmm0)) avx-vzeroupper-27.c:24 -1 where mode switching correctly placed vzeroupper insn, based on MODE_EXIT mode, determined from final use in (insn 44). 2012-11-04 Vladimir Yakovlev <vladimir.b.yakov...@intel.com> Uros Bizjak <ubiz...@gmail.com> * mode-switching.c (create_pre_exit): Added code for maybe_builtin_apply case. Patch was bootstrapped and regression tested on x86_64-pc-linux-gnu, with vzeroupper patch [1] applied. I have added SH4 maintainer for possible comments. OK for mainline SVN? [1] http://gcc.gnu.org/ml/gcc-patches/2012-11/msg00292.html Uros. -- Index: mode-switching.c =================================================================== --- mode-switching.c (revision 193133) +++ mode-switching.c (working copy) @@ -342,6 +342,16 @@ create_pre_exit (int n_entities, int *entity_map, } if (j >= 0) { + /* __builtin_return emits a sequence of loads to all + function value registers in their widest mode, + which breaks the assumption on the mode of the + return register load. Allow this situation, so the + final mode switch will be emitted after the load. */ + if (maybe_builtin_apply + && targetm.calls.function_value_regno_p + (copy_start)) + forced_late_switch = 1; + /* For the SH4, floating point loads depend on fpscr, thus we might need to put the final mode switch after the return value copy. That is still OK,