https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104674
Jakub Jelinek <jakub at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |uros at gcc dot gnu.org --- Comment #3 from Jakub Jelinek <jakub at gcc dot gnu.org> --- So, when emitting the __divmoddi4 call, expand_DIVMOD -> ix86_expand_divmod_libfunc calls assign_386_stack_local (E_DImode, SLOT_TEMP) to obtain a temporary stack slot for the remainder. (mem:DI (plus:SI (frame) (const_int -8))) is what is returned and the IL looks reasonable e.g. in vregs: (insn 12 6 13 2 (parallel [ (set (reg:SI 97) (plus:SI (reg/f:SI 19 frame) (const_int -8 [0xfffffffffffffff8]))) (clobber (reg:CC 17 flags)) ]) 229 {*addsi_1} (nil)) ... (insn 19 18 20 2 (set (reg:DI 89 [ divmod_tmp_15 ]) (reg:DI 0 ax)) 80 {*movdi_internal} (nil)) (insn 20 19 21 2 (set (reg:DI 90 [ divmod_tmp_15+8 ]) (mem/c:DI (plus:SI (reg/f:SI 19 frame) (const_int -8 [0xfffffffffffffff8])) [0 S8 A64])) 80 {*movdi_internal} (nil)) ... (insn 25 24 26 2 (set (reg/v:DF 87 [ s ]) (float:DF (reg:DI 89 [ divmod_tmp_15 ]))) "pr104674.c":8:10 214 {*floatdidf2_i387} (nil)) ... (insn 30 29 31 2 (set (reg:DF 98) (float:DF (reg:SI 104 [ divmod_tmp_15+8 ]))) "pr104674.c":9:14 207 {*floatsidf2} (expr_list:REG_DEAD (reg:SI 104 [ divmod_tmp_15+8 ]) (nil))) i.e. it first loads from the temporary slot and only afterwards does some further operations on the results. Later on that insn 20 becomes (insn 67 19 21 2 (set (reg:SI 104 [ divmod_tmp_15+8 ]) (mem/c:SI (plus:SI (reg/f:SI 19 frame) (const_int -8 [0xfffffffffffffff8])) [0 S4 A64])) 81 {*movsi_internal} (nil)) but it is still ok. Combine propagates that memory load into a later insn though, so we have: ... (insn 70 18 19 2 (set (reg:DI 106) (reg:DI 0 ax)) -1 (expr_list:REG_DEAD (reg:DI 0 ax) (nil))) ... (insn 25 24 26 2 (set (reg/v:DF 87 [ s ]) (float:DF (reg:DI 106))) "pr104674.c":8:10 214 {*floatdidf2_i387} (expr_list:REG_DEAD (reg:DI 106) (nil))) ... (insn 30 29 31 2 (set (reg:DF 98) (float:DF (mem/c:SI (plus:SI (reg/f:SI 19 frame) (const_int -8 [0xfffffffffffffff8])) [0 S4 A64]))) "pr104674.c":9:14 207 {*floatsidf2} (nil)) i.e. effective it extended the lifetime of the DImode SLOT_TEMP (well, the low SImode part of it) across insn 25. But then the split1 pass splits the: (insn 25 24 26 2 (set (reg/v:DF 87 [ s ]) (float:DF (reg:DI 106))) "pr104674.c":8:10 214 {*floatdidf2_i387} (expr_list:REG_DEAD (reg:DI 106) (nil))) insn into: (insn 72 24 26 2 (parallel [ (set (reg/v:DF 87 [ s ]) (float:DF (reg:DI 106))) (clobber (mem/c:DI (plus:SI (reg/f:SI 19 frame) (const_int -8 [0xfffffffffffffff8])) [0 S8 A64])) (clobber (scratch:V4SI)) (clobber (scratch:V4SI)) ]) "pr104674.c":8:10 -1 (nil)) and uses there assign_386_stack_local (E_DImode, SLOT_TEMP) which returns the same temporary slot which is unfortunately live across that instruction: ;; Avoid store forwarding (partial memory) stall penalty ;; by passing DImode value through XMM registers. */ (define_split [(set (match_operand:X87MODEF 0 "register_operand") (float:X87MODEF (match_operand:DI 1 "register_operand")))] "!TARGET_64BIT && TARGET_INTER_UNIT_MOVES_TO_VEC && TARGET_80387 && X87_ENABLE_FLOAT (<X87MODEF:MODE>mode, DImode) && TARGET_SSE2 && optimize_function_for_speed_p (cfun) && can_create_pseudo_p ()" [(const_int 0)] { emit_insn (gen_floatdi<mode>2_i387_with_xmm (operands[0], operands[1], assign_386_stack_local (DImode, SLOT_TEMP))); DONE; }) >From what I can see, SLOT_TEMP is used in: i386.md: assign_386_stack_local (DImode, SLOT_TEMP))); i386.md: assign_386_stack_local (DImode, SLOT_TEMP))); sync.md: assign_386_stack_local (DImode, SLOT_TEMP))); sync.md: assign_386_stack_local (DImode, SLOT_TEMP))); i386-expand.cc: target = assign_386_stack_local (SImode, SLOT_TEMP); i386-expand.cc: target = assign_386_stack_local (SImode, SLOT_TEMP); i386-expand.cc: rtx rem = assign_386_stack_local (mode, SLOT_TEMP); and except for this define_split, all other uses are either in some builtin's expansion or in define_expand, those look good, but in this define_split, I think it can't guarantee that SLOT_TEMP isn't live across the insn being split. so we need to use a different SLOT_* kind there.