> On Apr 21, 2022, at 2:09 AM, Richard Biener <richard.guent...@gmail.com> > wrote: > > On Wed, Apr 20, 2022 at 6:02 PM Qing Zhao <qing.z...@oracle.com> wrote: >> >> >> >>> On Apr 20, 2022, at 5:38 AM, Richard Biener <richard.guent...@gmail.com> >>> wrote: >>> >>> On Tue, Apr 19, 2022 at 11:36 PM Qing Zhao <qing.z...@oracle.com> wrote: >>>> >>>> >>>> >>>>> On Apr 14, 2022, at 1:53 AM, Richard Biener <richard.guent...@gmail.com> >>>>> wrote: >>>>> >>>>> On Wed, Apr 13, 2022 at 5:22 PM Qing Zhao <qing.z...@oracle.com> wrote: >>>>>> >>>>>> Hi, Richard, >>>>>> >>>>>> Thanks a lot for taking a look at this issue (and Sorry that I haven’t >>>>>> fixed this one yet, I was distracted by other tasks then just forgot >>>>>> this one….) >>>>>> >>>>>>> On Apr 13, 2022, at 3:41 AM, Richard Biener >>>>>>> <richard.guent...@gmail.com> wrote: >>>>>>> >>>>>>> On Tue, Feb 15, 2022 at 5:31 PM Qing Zhao via Gcc-patches >>>>>>> <gcc-patches@gcc.gnu.org> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> On Feb 15, 2022, at 3:58 AM, Jakub Jelinek <ja...@redhat.com> wrote: >>>>>>>>> >>>>>>>>> Hi! >>>>>>>>> >>>>>>>>> For IBM double double I've added in PR95450 and PR99648 verification >>>>>>>>> that >>>>>>>>> when we at the tree/GIMPLE or RTL level interpret target bytes as a >>>>>>>>> REAL_CST >>>>>>>>> or CONST_DOUBLE constant, we try to encode it back to target bytes and >>>>>>>>> verify it is the same. >>>>>>>>> This is because our real.c support isn't able to represent all valid >>>>>>>>> values >>>>>>>>> of IBM double double which has variable precision. >>>>>>>>> In PR104522, it has been noted that we have similar problem with the >>>>>>>>> Intel/Motorola extended XFmode formats, our internal representation >>>>>>>>> isn't >>>>>>>>> able to record pseudo denormals, pseudo infinities, pseudo NaNs and >>>>>>>>> unnormal >>>>>>>>> values. >>>>>>>>> So, the following patch is an attempt to extend that verification to >>>>>>>>> all >>>>>>>>> floats. >>>>>>>>> Unfortunately, it wasn't that straightforward, because the >>>>>>>>> __builtin_clear_padding code exactly for the XFmode long doubles >>>>>>>>> needs to >>>>>>>>> discover what bits are padding and does that by interpreting memory of >>>>>>>>> all 1s. That is actually a valid supported value, a qNaN with >>>>>>>>> negative >>>>>>>>> sign with all mantissa bits set, but the verification includes also >>>>>>>>> the >>>>>>>>> padding bits (exactly what __builtin_clear_padding wants to figure >>>>>>>>> out) >>>>>>>>> and so fails the comparison check and so we ICE. >>>>>>>>> The patch fixes that case by moving that verification from >>>>>>>>> native_interpret_real to its caller, so that clear_padding_type can >>>>>>>>> call native_interpret_real and avoid that extra check. >>>>>>>>> >>>>>>>>> With this, the only thing that regresses in the testsuite is >>>>>>>>> +FAIL: gcc.target/i386/auto-init-4.c scan-assembler-times >>>>>>>>> long\\t-16843010 5 >>>>>>>>> because it decides to use a pattern that has non-zero bits in the >>>>>>>>> padding >>>>>>>>> bits of the long double, so the simplify-rtx.cc change prevents >>>>>>>>> folding >>>>>>>>> a SUBREG into a constant. We emit (the testcase is -O0 but we emit >>>>>>>>> worse >>>>>>>>> code at all opt levels) something like: >>>>>>>>> movabsq $-72340172838076674, %rax >>>>>>>>> movabsq $-72340172838076674, %rdx >>>>>>>>> movq %rax, -48(%rbp) >>>>>>>>> movq %rdx, -40(%rbp) >>>>>>>>> fldt -48(%rbp) >>>>>>>>> fstpt -32(%rbp) >>>>>>>>> instead of >>>>>>>>> fldt .LC2(%rip) >>>>>>>>> fstpt -32(%rbp) >>>>>>>>> ... >>>>>>>>> .LC2: >>>>>>>>> .long -16843010 >>>>>>>>> .long -16843010 >>>>>>>>> .long 65278 >>>>>>>>> .long 0 >>>>>>>>> Note, neither of those sequences actually stores the padding bits, >>>>>>>>> fstpt >>>>>>>>> simply doesn't touch them. >>>>>>>>> For vars with clear_padding_real_needs_padding_p types that are >>>>>>>>> allocated >>>>>>>>> to memory at expansion time, I'd say much better would be to do the >>>>>>>>> stores >>>>>>>>> using integral modes rather than XFmode, so do that: >>>>>>>>> movabsq $-72340172838076674, %rax >>>>>>>>> movq %rax, -32(%rbp) >>>>>>>>> movq %rax, -24(%rbp) >>>>>>>>> directly. That is the only way to ensure the padding bits are >>>>>>>>> initialized >>>>>>>>> (or expand __builtin_clear_padding, but then you initialize >>>>>>>>> separately the >>>>>>>>> value bits and padding bits). >>>>>>>>> >>>>>>>>> Bootstrapped/regtested on x86_64-linux and i686-linux, though as >>>>>>>>> mentioned >>>>>>>>> above, the gcc.target/i386/auto-init-4.c case is unresolved. >>>>>>>> >>>>>>>> Thanks, I will try to fix this testing case in a later patch. >>>>>>> >>>>>>> I've looked at this FAIL now and really wonder whether "pattern init" as >>>>>>> implemented makes any sense for non-integral types. >>>>>>> We end up with >>>>>>> initializing a register (SSA name) with >>>>>>> >>>>>>> VIEW_CONVERT_EXPR<long double>(0xfefefefefefefefefefefefefefefefe) >>>>>>> >>>>>>> as we go building a TImode constant (we verified we have a TImode SET!) >>>>>>> but then >>>>>>> >>>>>>> /* Pun the LHS to make sure its type has constant size >>>>>>> unless it is an SSA name where that's already known. */ >>>>>>> if (TREE_CODE (lhs) != SSA_NAME) >>>>>>> lhs = build1 (VIEW_CONVERT_EXPR, itype, lhs); >>>>>>> else >>>>>>> init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), init); >>>>>>> ... >>>>>>> expand_assignment (lhs, init, false); >>>>>>> >>>>>>> and generally registers do not have any padding. This weird expansion >>>>>>> then causes us to spill the TImode constant and reload the XFmode value, >>>>>>> which is definitely not optimal here. >>>>>>> >>>>>>> One approach to avoid the worse code generation would be to use mode >>>>>>> specific patterns for registers (like using a NaN or a target specific >>>>>>> value that >>>>>>> can be loaded cheaply), >>>>>> >>>>>> You mean that using “mode specific patterns” ONLY for registers? >>>>>> Can we use “mode specific patterns” consistently for both registers and >>>>>> memory? >>>>> >>>>> The difference is that registers don't have padding while memory >>>>> possibly does, also >>>>> for aggregates using different patterns is too expensive IMHO. For >>>>> registers the extra >>>>> complication with generic patterns is that efficient initialization >>>>> (without going through memory) >>>>> should be a priority IMHO. >>>>> >>>>> And for stack memory I still think that initializing the full >>>>> allocated frame (including padding >>>>> between variables!) is the best approach. >>>>> >>>>>> LLVM use different patterns for different types (Integral, Float, >>>>>> pointer type, etc…) in order to >>>>>> Catch bugs easily for different types. >>>>>> >>>>>> The beginning versions of this patch tried to do similar as LLVM, but >>>>>> later we gave up this and >>>>>> Use the same pattern for all different types. >>>>>> >>>>>> If now, we want to use “mode specific patterns” for registers, I am >>>>>> wondering whether it’s >>>>>> Better to just use “mode specific patterns” consistently for both >>>>>> registers and memory? >>>>>> >>>>>>> another would be to simply fall back to zero >>>>>>> initialization >>>>>>> when we fail to constant fold the initializer like with >>>>>>> >>>>>>> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc >>>>>>> index 8b1733e20c4..a4b995f71e4 100644 >>>>>>> --- a/gcc/internal-fn.cc >>>>>>> +++ b/gcc/internal-fn.cc >>>>>>> @@ -3125,7 +3125,11 @@ expand_DEFERRED_INIT (internal_fn, gcall *stmt) >>>>>>> if (TREE_CODE (lhs) != SSA_NAME) >>>>>>> lhs = build1 (VIEW_CONVERT_EXPR, itype, lhs); >>>>>>> else >>>>>>> - init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), >>>>>>> init); >>>>>>> + { >>>>>>> + init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), >>>>>>> init); >>>>>>> + if (!CONSTANT_CLASS_P (init)) >>>>>>> + init = build_zero_cst (TREE_TYPE (lhs)); >>>>>>> + } >>>>>>> } >>>>>>> else >>>>>>> /* Use zero-init also for variable-length sizes. */ >>>>>>> >>>>>>> note that still does not address the issue noted by Jakub that we do not >>>>>>> initialize padding this way - but of course that's because we expand a >>>>>>> special assignment from .DEFERRED_INIT and are not initializing >>>>>>> the storage allocated for the variable on the stack (at -O0) by RTL >>>>>>> expansion. Ideally .DEFERRED_INIT "expansion" for stack variables >>>>>>> would take place there, simply filling the allocated frame with the >>>>>>> pattern or zero. That would probably mean that RTL expansion >>>>>>> should scoop up .DEFERRED_INITs for variables it decides not to >>>>>>> expand to pseudos and not leave that to stmt expansion. >>>>>> >>>>>> Need to study this a little bit more... >>>> >>>> >>>> So, Is what you mean in the above a complete rewrite of the current >>>> “expand_DEFERRED_INIT”: >>>> >>>> Instead of simply using “expand_builtin_memset” for “variables that are >>>> not in register” and “expand_assignment” >>>> for “variables that are in register”, RTL expansion should directly >>>> expand this call in a lower level: >>>> >>>> i.e, >>>> >>>> tree lhs = gimple_call_lhs (stmt); >>>> … >>>> >>>> rtx target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE); >>>> >>>> If (MEM_P (target)) // the variable is allocated on stack >>>> { >>>> // filling the allocated frame with pattern or zero. How to do it?? >>>> } >>>> else // the variable is in pseudo register. >>>> { >>>> rtx init_rtx = …; >>>> emit_move_insn (target, init_rtx) >>>> } >>>> >>>> >>>> Is the above a correct understanding? >>> >>> No, I would simply expand the automatic in-memory var .DEFERRED_INIT >>> calls to nothing >>> but instead process their effect at the time we do >>> expand_one_stack_var, which means at >>> CFG expansion time scan for the .DEFERRED_INITs and record whether and how >>> to >>> initialize the variables. We can then group the variables accordingly >>> and block-initialize >>> the stack space also covering padding inbetween variables. >> >> So, the above only covers the variables that will be allocated in stack? > > Yes. > >> For the variables that are in pseudo registers, we still need to simply >> expand the initialization to a move insn? > > Yes, also for stack VLAs.
Why for stack VLAs? > >> And for the variables in pseudo registers, later after register allocation, >> some of them will be spilled into stack, too. >> Specially for long double/complex double variables that might have padding, >> shall we specially handle them during that time? > > Ideally we'd never spill uninitialized variables but definitely the > auto-init can cause the need to spill that very > value. That means that auto-init of registers should happen after RA? All the initialization of registers should happen after RA? Only the ones with “long double/complex double” type that might have padding? > One would need to see what > IRA/LRA do to uninitialized pseudo uses. Computing may uninit uses > after RA should be possible and > hopefully all ISAs allow initializing hard registers without requiring > a scratch register (heh - I'm almost sure > there will be exceptions …) Need more study here…. Qing > > Richard. > >> Qing >>> >>> Richard. >>> >>>> Thanks. >>>> >>>> Qing >>>> >>>>>> >>>>>>> >>>>>>> I'm going to simply XFAIL this testcase for GCC 12 and opened >>>>>>> PR105259 so we can revisit the above for GCC 13. >>>>>> >>>>>> I can further study this bug and try to come up with a good solution in >>>>>> gcc13. >>>>>> >>>>>> Thank you! >>>>>> >>>>>> Qing >>>>>>> >>>>>>> Richard. >>>>>>> >>>>>>>> Qing