Re: Possible issue with ARC gcc 4.8
On Monday 06 July 2015 12:04 PM, Richard Biener wrote: >> The point being ARC ISA provides a neat feature where core only considers >> lower 5 >> > bits of bitpos operands. Thus we can make such behaviour not only >> > deterministic in >> > the context of ARC, but also optimal, eliding the need for doing specific >> > masking/clamping to 5 bits. > There is SHIFT_COUNT_TRUNCATED which allows you to combine > b & 31 with the shift value if you instead write a << (b & 31). Awesome, this is what we need for ARC then along with the user code change to add the masking so the combiner will nullify the masking. > > Of course a << 63 is still undefined behavior regardless of target behavior. Sure Thx, -Vineet
Making __builtin_signbit type-generic
Hi, Many of the floating point-related builtins are type-generic, including __builtin_{isfinite,isinf_sign,isinf,isnan,isnormal,isgreater,islesser,isunordered}. However, __builtin_signbit is not. It would make life easier for the implementation of IEEE in Fortran if it were, and probably for some other stuff too (PR 36757). I don’t know where to start and how to achieve that, though. Could someone who knows this middle-end FP stuff help? Thanks, FX
Re: Making __builtin_signbit type-generic
On Mon, Jul 6, 2015 at 11:49 AM, FX wrote: > Many of the floating point-related builtins are type-generic, including > __builtin_{isfinite,isinf_sign,isinf,isnan,isnormal,isgreater,islesser,isunordered}. > However, __builtin_signbit is not. It would make life easier for the > implementation of IEEE in Fortran if it were, and probably for some other > stuff too (PR 36757). > > I don’t know where to start and how to achieve that, though. Could someone > who knows this middle-end FP stuff help? Please look at builtins.def, grep for TYPEGENERIC. Uros.
Re: Making __builtin_signbit type-generic
> Please look at builtins.def, grep for TYPEGENERIC. Sure, I can define it as TYPEGENERIC in the builtins.def, like this: Index: builtins.def === --- builtins.def(revision 225434) +++ builtins.def(working copy) @@ -489,7 +489,7 @@ DEF_C99_BUILTIN(BUILT_IN_SCALBLN DEF_C99_BUILTIN(BUILT_IN_SCALBN, "scalbn", BT_FN_DOUBLE_DOUBLE_INT, ATTR_MATHFN_FPROUNDING_ERRNO) DEF_C99_BUILTIN(BUILT_IN_SCALBNF, "scalbnf", BT_FN_FLOAT_FLOAT_INT, ATTR_MATHFN_FPROUNDING_ERRNO) DEF_C99_BUILTIN(BUILT_IN_SCALBNL, "scalbnl", BT_FN_LONGDOUBLE_LONGDOUBLE_INT, ATTR_MATHFN_FPROUNDING_ERRNO) -DEF_EXT_LIB_BUILTIN(BUILT_IN_SIGNBIT, "signbit", BT_FN_INT_DOUBLE, ATTR_CONST_NOTHROW_LEAF_LIST) +DEF_EXT_LIB_BUILTIN(BUILT_IN_SIGNBIT, "signbit", BT_FN_INT_VAR, ATTR_CONST_NOTHROW_TYPEGENERIC_LEAF) DEF_EXT_LIB_BUILTIN(BUILT_IN_SIGNBITF, "signbitf", BT_FN_INT_FLOAT, ATTR_CONST_NOTHROW_LEAF_LIST) DEF_EXT_LIB_BUILTIN(BUILT_IN_SIGNBITL, "signbitl", BT_FN_INT_LONGDOUBLE, ATTR_CONST_NOTHROW_LEAF_LIST) DEF_EXT_LIB_BUILTIN(BUILT_IN_SIGNBITD32, "signbitd32", BT_FN_INT_DFLOAT32, ATTR_CONST_NOTHROW_LEAF_LIST) But does that mean the middle-end will handle it completely? FX
Uninitialized registers handling in the REE pass
Hello, The attached reproducer[1] seems to trigger a code generation issue at least on x86_64-linux: $ gnatmake -q p -O3 -gnatn $ ./p raised PROGRAM_ERROR : p.adb:9 explicit raise The bottom line is that when Success is False in Q.Get (q.adb, around line 40) its value is clobbered during return. If we build with -fno-ree, we can see in the assembly code (near the return point for q__get) the following sequence: movzwl (%rax), %epb ... somelabel: ... movzwl %bp, %ebp ... salq$16, %rax orq %rbp, %rax However, without the -fno-ree switch the instruction: movzwl %bp, %ebp is merged with its definition (the first movzwl instruction) by the REE pass. This is wrong since the definition does _not_ dominate this zero-extension: the first movzwl instruction can be by-passed through "somelabel". I started to investigate this issue: the REE pass is currently not aware of this by-pass code path because it is reached by no proper definition (it brings an uninitialized %rbp register to the zero-extension). This happens in ree.c:get_defs; in our case (insn=second movzwl, reg=BP) DF_REF_CHAIN contains only one definition: the movzwl instruction. Given the "somelabel" code path, I would rather expect DF_REF_CHAIN to hold a NULL reference to materialize the lack of initialization in one path. I found the DF_LR_IN/DF_LR_OUT macros from df.h: they provide information about uninitialized variables but the associated comment says they should not be used ("This intolerance should eventually be fixed."). Besides, they provide a basic-block-level information whereas we are rather interested in instruction-level information in REE. I was thinking that REE may not be the only optimization pass needing this kind of information but I found no other one, so I'm not sure how this ought to be handled. Can anybody confirm my intuition about the NULL reference in DF_REF_CHAIN? I'm willing to implement it but I prefer first to be sure this is the right approach. Thanks in advance! [1] It's in Ada. I tried to translate it into C but any change in register allocation makes the issue disappear... -- Pierre-Marie de Rodat with Q; use Q; procedure P is B : Boolean; E : Enum; begin Get ("four", E, B); if B = True then raise Program_Error; end if; Get ("three", E, B); if B = False then raise Program_Error; end if; declare A : Enum_Boolean_Array (One .. E) := (others => True); begin Set (A); end; end; with Ada.Characters.Handling; with Ada.Containers; with Ada.Containers.Indefinite_Hashed_Maps; with Ada.Strings.Hash; package body Q is type Enum_Name is array (Enum) of access constant String; Enum_Name_Table : constant Enum_Name := ( One => new String'("one"), Two => new String'("two"), Three => new String'("three")); package String_To_Enum_Map is new Ada.Containers.Indefinite_Hashed_Maps (Key_Type => String, Element_Type => Enum, Hash => Ada.Strings.Hash, Equivalent_Keys => "="); function Fill_Hashed_Map return String_To_Enum_Map.Map is Res : String_To_Enum_Map.Map; use String_To_Enum_Map; begin for I in Enum_Name_Table'Range loop declare Kind : constant String := Enum_Name_Table (I).all; begin Res.Insert(Key => Kind, New_Item => I); end; end loop; return Res; end; String_To_Enum : constant String_To_Enum_Map.Map := Fill_Hashed_Map; procedure Get (Kind : String; Result : out Enum; Success : out Boolean) is X : constant String := Ada.Characters.Handling.To_Lower (Kind); use String_To_Enum_Map; Curs : constant Cursor := String_To_Enum.Find (X); begin Success := Curs /= No_Element; if Success then Result := Element(Curs); end if; end; procedure Set (A : Enum_Boolean_Array) is null; end Q; package Q is type Enum is (One, Two, Three); for Enum'Size use 16; type Enum_Boolean_Array is array (Enum range <>) of Boolean; procedure Get (Kind : String; Result : out Enum; Success : out Boolean); procedure Set (A : Enum_Boolean_Array); end Q;
Re: Allocation of hotness of data structure with respect to the top of stack.
On 07/05/2015 05:11 AM, Ajit Kumar Agarwal wrote: All: I am wondering allocation of hot data structure closer to the top of the stack increases the performance of the application. The data structure are identified as hot and cold data structure and all the data structures are sorted in decreasing order of The hotness and the hot data structure will be allocated closer to the top of the stack. The load and store on accessing with respect to allocation of data structure on stack will be faster with allocation of hot Data structure closer to the top of the stack. Based on the above the code is generated with respect to load and store with the correct offset of the stack allocated on the decreasing order of hotness. You might want to look at this paper from an old gcc summit conference. Basically they were trying to reorder stack slots to minimize offsets in reg+d addressing for hte SH port. It should touch on a number of common issues/goals. ftp://gcc.gnu.org/pub/gcc/summit/2003/Optimal%20Stack%20Slot%20Assignment.pdf I can't recall if they ever tried to submit that work for inclusion. Jeff
Re: rl78 vs cse vs memory_address_addr_space
On 07/01/2015 10:14 PM, DJ Delorie wrote: In this bit of code in explow.c: /* By passing constant addresses through registers we get a chance to cse them. */ if (! cse_not_expected && CONSTANT_P (x) && CONSTANT_ADDRESS_P (x)) x = force_reg (address_mode, x); On the rl78 it results in code that's a bit too complex for later passes to be optimized fully. Is there any way to indicate that the above force_reg() is bad for a particular target? I believe this used to be conditional on -fforce-mem or -fforce-reg or some such option that we deprecated long ago. It'd be helpful if you could be more specific about what can't be handled. combine for example was extended to handle larger chains of insns not terribly long ago. jeff
Re: rl78 vs cse vs memory_address_addr_space
Given a test case like this: typedef struct { unsigned char no0 :1; unsigned char no1 :1; unsigned char no2 :1; unsigned char no3 :1; unsigned char no4 :1; unsigned char no5 :1; unsigned char no6 :1; unsigned char no7 :1; } __BITS8; #define SFR0_bit (*(volatile union __BITS9 *)0x0) #define SFREN SFR0_bit.no4 foo() { SFREN = 1U; SFREN = 0U; } (i.e. any code that sets/clears one bit in a volatile memory-mapped area, which the rl78 has instructions for) Before: (insn 5 2 7 2 (set (reg/f:HI 43) (const_int 240 [0xf0])) test.c:24 7 {*movhi_virt} (nil)) (insn 7 5 8 2 (set (reg:QI 45 [ MEM[(volatile union un_per0 *)240B].BIT.no4 ]) (mem/v/j:QI (reg/f:HI 43) [0 MEM[(volatile union un_per0 *)240B].BIT.no4+0 S1 A16])) test.c:24 5 {movqi_virt} (nil)) (insn 8 7 9 2 (set (reg:QI 46) (ior:QI (reg:QI 45 [ MEM[(volatile union un_per0 *)240B].BIT.no4 ]) (const_int 16 [0x10]))) test.c:24 19 {*iorqi3_virt} (expr_list:REG_DEAD (reg:QI 45 [ MEM[(volatile union un_per0 *)240B].BIT.no4 ]) (nil))) (insn 9 8 12 2 (set (mem/v/j:QI (reg/f:HI 43) [0 MEM[(volatile union un_per0 *)240B].BIT.no4+0 S1 A16]) (reg:QI 46)) test.c:24 5 {movqi_virt} (expr_list:REG_DEAD (reg:QI 46) (nil))) (insn 12 9 13 2 (set (reg:QI 49 [ MEM[(volatile union un_per0 *)240B].BIT.no4 ]) (mem/v/j:QI (reg/f:HI 43) [0 MEM[(volatile union un_per0 *)240B].BIT.no4+0 S1 A16])) test.c:26 5 {movqi_virt} (nil)) (insn 13 12 14 2 (set (reg:QI 50) (and:QI (reg:QI 49 [ MEM[(volatile union un_per0 *)240B].BIT.no4 ]) (const_int -17 [0xffef]))) test.c:26 18 {*andqi3_virt} (expr_list:REG_DEAD (reg:QI 49 [ MEM[(volatile union un_per0 *)240B].BIT.no4 ]) (nil))) (insn 14 13 0 2 (set (mem/v/j:QI (reg/f:HI 43) [0 MEM[(volatile union un_per0 *)240B].BIT.no4+0 S1 A16]) (reg:QI 50)) test.c:26 5 {movqi_virt} (expr_list:REG_DEAD (reg:QI 50) (expr_list:REG_DEAD (reg/f:HI 43) (nil Combine gets as far as this: Trying 5 -> 9: Failed to match this instruction: (parallel [ (set (mem/v/j:QI (const_int 240 [0xf0]) [0 MEM[(volatile union un_per0 *)240B].BIT.no4+0 S1 A16]) (ior:QI (mem/v/j:QI (const_int 240 [0xf0]) [0 MEM[(volatile union un_per0 *)240B].BIT.no4+0 S1 A16]) (const_int 16 [0x10]))) (set (reg/f:HI 43) (const_int 240 [0xf0])) ]) (the set is left behind because it's used for the second assignment) Both of those insns in the parallel are valid rl78 insns. I tried adding that parallel as a define-and-split but combine doesn't split it at the point where it inserts it, so it doesn't work right. If it reduced those four instructions to the two in the parallel, but without the parallel, it would probably work too. We end up with code like this: movwr8, #240 ; 5*movhi_real/4 [length = 4] movwax, r8 ; 19 *movhi_real/5 [length = 4] movwhl, ax ; 21 *movhi_real/6 [length = 4] set1[hl].4 ; 9*iorqi3_real/1 [length = 4] clr1[hl].4 ; 14 *andqi3_real/1 [length = 4] but what we want is this: set1!240.4 ; 9*iorqi3_real/1 [length = 4] clr1!240.4 ; 14 *andqi3_real/1 [length = 4] ( !240 means (mem (const_int 240)) ) (if there's only one such operation in a function, it combines properly, likely because the address is not needed after the insn it can combine, unlike the parallel above) The common addresses are separated at least before lowering to RTL; as the initial expansion has: ;; MEM[(volatile union un_per0 *)240B].BIT.no4 ={v} 1; (insn 5 4 7 (set (reg/f:HI 43) (const_int 240 [0xf0])) test.c:24 -1 (nil)) (insn 7 5 8 (set (reg:QI 45) (mem/v/j:QI (reg/f:HI 43) [0 MEM[(volatile union un_per0 *)240B].BIT.no4+0 S1 A16])) test.c:24 -1 (nil)) (insn 8 7 9 (set (reg:QI 46) (ior:QI (reg:QI 45) (const_int 16 [0x10]))) test.c:24 -1 (nil)) (insn 9 8 0 (set (mem/v/j:QI (reg/f:HI 43) [0 MEM[(volatile union un_per0 *)240B].BIT.no4+0 S1 A16]) (reg:QI 46)) test.c:24 -1 (nil)) Yes, I know gcc doesn't like combining volatile accesses into one insn, but the rl78 backend (my copy at least) has predicates that allow it, because it's safe on rl78. Also, if I take out the "volatile" yet put some sort of barrier (like a volatile asm) between the two assignments, it still fails, in the same manner.
Re: rl78 vs cse vs memory_address_addr_space
On Mon, Jul 06, 2015 at 04:45:35PM -0400, DJ Delorie wrote: > Combine gets as far as this: > > Trying 5 -> 9: > Failed to match this instruction: > (parallel [ > (set (mem/v/j:QI (const_int 240 [0xf0]) [0 MEM[(volatile union > un_per0 *)240B].BIT.no4+0 S1 A16]) > (ior:QI (mem/v/j:QI (const_int 240 [0xf0]) [0 MEM[(volatile union > un_per0 *)240B].BIT.no4+0 S1 A16]) > (const_int 16 [0x10]))) > (set (reg/f:HI 43) > (const_int 240 [0xf0])) > ]) > > (the set is left behind because it's used for the second assignment) > > Both of those insns in the parallel are valid rl78 insns. I tried > adding that parallel as a define-and-split but combine doesn't split > it at the point where it inserts it, so it doesn't work right. If it > reduced those four instructions to the two in the parallel, but > without the parallel, it would probably work too. Did you try just a define_split instead? Ugly, but it should work I think. Segher
Does GCC generate LDRD/STRD (Register) forms?
Hello, Does GCC generate LDRD/STRD (Register) forms [A8.8.74/A8.8.211 per ARMv7-A & ARMv7-R ARM]? Based on various attempts to write code to get GCC to generate a sample form, and subsequently inspecting the code I see in config/arm/arm.c/output_move_double () & arm.md [GCC 4.9.2], I think that these register based forms of LDRD/STRD are not generated, but I thought it might be a good idea to ask on the list, just in case. Thanks, Anmol.
Re: Does GCC generate LDRD/STRD (Register) forms?
On Tue, Jul 7, 2015 at 10:05 AM, Anmol Paralkar (anmparal) wrote: > Hello, > > Does GCC generate LDRD/STRD (Register) forms [A8.8.74/A8.8.211 per ARMv7-A > & ARMv7-R ARM]? > > Based on various attempts to write code to get GCC to generate a sample > form, and subsequently inspecting the code I see in > config/arm/arm.c/output_move_double () & arm.md [GCC 4.9.2], I think that > these register based forms of LDRD/STRD are > not generated, but I thought it might be a good idea to ask on the list, > just in case. Register based LDRD is harder than immediate version. ARM doesn't support [base + reg + offset] addressing mode, so address computation of the second memory reference is scattered both in and out of memory reference. To identify such opportunities, one needs to trace registers in address expression the memory access instruction and does some kind of value computation and re-association. Thanks, bin > > Thanks, > Anmol. >
Re: rl78 vs cse vs memory_address_addr_space
> Did you try just a define_split instead? Ugly, but it should work I think. It doesn't seem to be able to match a define_split :-(