Re: Possible issue with ARC gcc 4.8

2015-07-06 Thread Vineet Gupta
On Monday 06 July 2015 12:04 PM, Richard Biener wrote:
>> The point being ARC ISA provides a neat feature where core only considers 
>> lower 5
>> > bits of bitpos operands. Thus we can make such behaviour not only 
>> > deterministic in
>> > the context of ARC, but also optimal, eliding the need for doing specific
>> > masking/clamping to 5 bits.
> There is SHIFT_COUNT_TRUNCATED which allows you to combine
> b & 31 with the shift value if you instead write a << (b & 31).

Awesome, this is what we need for ARC then along with the user code change to 
add
the masking so the combiner will nullify the masking.

>
> Of course a << 63 is still undefined behavior regardless of target behavior.

Sure

Thx,
-Vineet



Making __builtin_signbit type-generic

2015-07-06 Thread FX
Hi,

Many of the floating point-related builtins are type-generic, including 
__builtin_{isfinite,isinf_sign,isinf,isnan,isnormal,isgreater,islesser,isunordered}.
 However, __builtin_signbit is not. It would make life easier for the 
implementation of IEEE in Fortran if it were, and probably for some other stuff 
too (PR 36757).

I don’t know where to start and how to achieve that, though. Could someone who 
knows this middle-end FP stuff help?

Thanks,
FX

Re: Making __builtin_signbit type-generic

2015-07-06 Thread Uros Bizjak
On Mon, Jul 6, 2015 at 11:49 AM, FX  wrote:

> Many of the floating point-related builtins are type-generic, including 
> __builtin_{isfinite,isinf_sign,isinf,isnan,isnormal,isgreater,islesser,isunordered}.
>  However, __builtin_signbit is not. It would make life easier for the 
> implementation of IEEE in Fortran if it were, and probably for some other 
> stuff too (PR 36757).
>
> I don’t know where to start and how to achieve that, though. Could someone 
> who knows this middle-end FP stuff help?

Please look at builtins.def, grep for TYPEGENERIC.

Uros.


Re: Making __builtin_signbit type-generic

2015-07-06 Thread FX
> Please look at builtins.def, grep for TYPEGENERIC.

Sure, I can define it as TYPEGENERIC in the builtins.def, like this:

Index: builtins.def
===
--- builtins.def(revision 225434)
+++ builtins.def(working copy)
@@ -489,7 +489,7 @@ DEF_C99_BUILTIN(BUILT_IN_SCALBLN
 DEF_C99_BUILTIN(BUILT_IN_SCALBN, "scalbn", BT_FN_DOUBLE_DOUBLE_INT, 
ATTR_MATHFN_FPROUNDING_ERRNO)
 DEF_C99_BUILTIN(BUILT_IN_SCALBNF, "scalbnf", BT_FN_FLOAT_FLOAT_INT, 
ATTR_MATHFN_FPROUNDING_ERRNO)
 DEF_C99_BUILTIN(BUILT_IN_SCALBNL, "scalbnl", 
BT_FN_LONGDOUBLE_LONGDOUBLE_INT, ATTR_MATHFN_FPROUNDING_ERRNO)
-DEF_EXT_LIB_BUILTIN(BUILT_IN_SIGNBIT, "signbit", BT_FN_INT_DOUBLE, 
ATTR_CONST_NOTHROW_LEAF_LIST)
+DEF_EXT_LIB_BUILTIN(BUILT_IN_SIGNBIT, "signbit", BT_FN_INT_VAR, 
ATTR_CONST_NOTHROW_TYPEGENERIC_LEAF)
 DEF_EXT_LIB_BUILTIN(BUILT_IN_SIGNBITF, "signbitf", BT_FN_INT_FLOAT, 
ATTR_CONST_NOTHROW_LEAF_LIST)
 DEF_EXT_LIB_BUILTIN(BUILT_IN_SIGNBITL, "signbitl", BT_FN_INT_LONGDOUBLE, 
ATTR_CONST_NOTHROW_LEAF_LIST)
 DEF_EXT_LIB_BUILTIN(BUILT_IN_SIGNBITD32, "signbitd32", BT_FN_INT_DFLOAT32, 
ATTR_CONST_NOTHROW_LEAF_LIST)


But does that mean the middle-end will handle it completely?

FX

Uninitialized registers handling in the REE pass

2015-07-06 Thread Pierre-Marie de Rodat

Hello,

The attached reproducer[1] seems to trigger a code generation issue at 
least on x86_64-linux:


$ gnatmake -q p -O3 -gnatn
$ ./p

raised PROGRAM_ERROR : p.adb:9 explicit raise

The bottom line is that when Success is False in Q.Get (q.adb, around 
line 40) its value is clobbered during return. If we build with 
-fno-ree, we can see in the assembly code (near the return point for 
q__get) the following sequence:


movzwl  (%rax), %epb
...
somelabel:
...
movzwl  %bp, %ebp
...
salq$16, %rax
orq %rbp, %rax

However, without the -fno-ree switch the instruction:

movzwl  %bp, %ebp

is merged with its definition (the first movzwl instruction) by the REE 
pass. This is wrong since the definition does _not_ dominate this 
zero-extension: the first movzwl instruction can be by-passed through 
"somelabel".


I started to investigate this issue: the REE pass is currently not aware 
of this by-pass code path because it is reached by no proper definition 
(it brings an uninitialized %rbp register to the zero-extension). This 
happens in ree.c:get_defs; in our case (insn=second movzwl, reg=BP) 
DF_REF_CHAIN contains only one definition: the movzwl instruction.


Given the "somelabel" code path, I would rather expect DF_REF_CHAIN to 
hold a NULL reference to materialize the lack of initialization in one 
path. I found the DF_LR_IN/DF_LR_OUT macros from df.h: they provide 
information about uninitialized variables but the associated comment 
says they should not be used ("This intolerance should eventually be 
fixed."). Besides, they provide a basic-block-level information whereas 
we are rather interested in instruction-level information in REE.


I was thinking that REE may not be the only optimization pass needing 
this kind of information but I found no other one, so I'm not sure how 
this ought to be handled. Can anybody confirm my intuition about the 
NULL reference in DF_REF_CHAIN? I'm willing to implement it but I prefer 
first to be sure this is the right approach.


Thanks in advance!


[1] It's in Ada. I tried to translate it into C but any change in 
register allocation makes the issue disappear...


--
Pierre-Marie de Rodat
with Q; use Q;

procedure P is
  B : Boolean;
  E : Enum;
begin
  Get ("four", E, B);
  if B = True then
raise Program_Error;
  end if;
  Get ("three", E, B);
  if B = False then
raise Program_Error;
  end if;
  declare
A : Enum_Boolean_Array (One .. E) := (others => True);
  begin
Set (A);
  end;
end;
with Ada.Characters.Handling;
with Ada.Containers;
with Ada.Containers.Indefinite_Hashed_Maps;
with Ada.Strings.Hash;

package body Q is

   type Enum_Name is array (Enum) of access constant String;

   Enum_Name_Table : constant Enum_Name := (
  One   => new String'("one"),
  Two   => new String'("two"),
  Three => new String'("three"));

   package String_To_Enum_Map is new Ada.Containers.Indefinite_Hashed_Maps
  (Key_Type => String, Element_Type => Enum,
   Hash => Ada.Strings.Hash, Equivalent_Keys => "=");

   function Fill_Hashed_Map return String_To_Enum_Map.Map is
  Res : String_To_Enum_Map.Map;
  use String_To_Enum_Map;
   begin
  for I in Enum_Name_Table'Range loop
 declare
Kind : constant String := Enum_Name_Table (I).all;
 begin
Res.Insert(Key => Kind, New_Item => I);
 end;
  end loop;
  return Res;
   end;

   String_To_Enum : constant String_To_Enum_Map.Map := Fill_Hashed_Map;

   procedure Get (Kind : String; Result : out Enum; Success : out Boolean) is
  X : constant String := Ada.Characters.Handling.To_Lower (Kind);
  use String_To_Enum_Map;
  Curs : constant Cursor := String_To_Enum.Find (X);
   begin
  Success := Curs /= No_Element;
  if Success then
 Result := Element(Curs);
  end if;
   end;

   procedure Set (A : Enum_Boolean_Array) is null;

end Q;
package Q is

   type Enum is (One, Two, Three);
   for Enum'Size use 16;

   type Enum_Boolean_Array is array (Enum range <>) of Boolean;

   procedure Get (Kind : String; Result : out Enum; Success : out Boolean);

   procedure Set (A : Enum_Boolean_Array);

end Q;


Re: Allocation of hotness of data structure with respect to the top of stack.

2015-07-06 Thread Jeff Law

On 07/05/2015 05:11 AM, Ajit Kumar Agarwal wrote:

All:

I am wondering allocation of hot data structure closer to the top of
the stack increases the performance of the application. The data
structure are identified as hot and cold data structure and all the
data structures are sorted in decreasing order of The hotness and the
hot data structure will be allocated closer to the top of the stack.

The load and store on accessing with respect to allocation of data
structure on stack will be faster with allocation of hot Data
structure closer to the top of the stack.

Based on the above the code is generated with respect to load and
store with the correct offset of the stack allocated on the
decreasing order of hotness.
You might want to look at this paper from an old gcc summit conference. 
 Basically they were trying to reorder stack slots to minimize offsets 
in reg+d addressing for hte SH port.  It should touch on a number of 
common issues/goals.



ftp://gcc.gnu.org/pub/gcc/summit/2003/Optimal%20Stack%20Slot%20Assignment.pdf


I can't recall if they ever tried to submit that work for inclusion.


Jeff


Re: rl78 vs cse vs memory_address_addr_space

2015-07-06 Thread Jeff Law

On 07/01/2015 10:14 PM, DJ Delorie wrote:

In this bit of code in explow.c:

   /* By passing constant addresses through registers
  we get a chance to cse them.  */
   if (! cse_not_expected && CONSTANT_P (x) && CONSTANT_ADDRESS_P (x))
 x = force_reg (address_mode, x);

On the rl78 it results in code that's a bit too complex for later
passes to be optimized fully.  Is there any way to indicate that the
above force_reg() is bad for a particular target?
I believe this used to be conditional on -fforce-mem or -fforce-reg or 
some such option that we deprecated long ago.


It'd be helpful if you could be more specific about what can't be 
handled.  combine for example was extended to handle larger chains of 
insns not terribly long ago.


jeff



Re: rl78 vs cse vs memory_address_addr_space

2015-07-06 Thread DJ Delorie

Given a test case like this:

typedef struct {
   unsigned char no0 :1;
   unsigned char no1 :1;
   unsigned char no2 :1;
   unsigned char no3 :1;
   unsigned char no4 :1;
   unsigned char no5 :1;
   unsigned char no6 :1;
   unsigned char no7 :1;
} __BITS8;
#define SFR0_bit (*(volatile union __BITS9 *)0x0)
#define SFREN SFR0_bit.no4

foo() {
   SFREN = 1U;
   SFREN = 0U;
}


(i.e. any code that sets/clears one bit in a volatile memory-mapped
area, which the rl78 has instructions for)

Before:

(insn 5 2 7 2 (set (reg/f:HI 43)
(const_int 240 [0xf0])) test.c:24 7 {*movhi_virt}
 (nil))
(insn 7 5 8 2 (set (reg:QI 45 [ MEM[(volatile union un_per0 *)240B].BIT.no4 ])
(mem/v/j:QI (reg/f:HI 43) [0 MEM[(volatile union un_per0 
*)240B].BIT.no4+0 S1 A16])) test.c:24 5 {movqi_virt}
 (nil))
(insn 8 7 9 2 (set (reg:QI 46)
(ior:QI (reg:QI 45 [ MEM[(volatile union un_per0 *)240B].BIT.no4 ])
(const_int 16 [0x10]))) test.c:24 19 {*iorqi3_virt}
 (expr_list:REG_DEAD (reg:QI 45 [ MEM[(volatile union un_per0 
*)240B].BIT.no4 ])
(nil)))
(insn 9 8 12 2 (set (mem/v/j:QI (reg/f:HI 43) [0 MEM[(volatile union un_per0 
*)240B].BIT.no4+0 S1 A16])
(reg:QI 46)) test.c:24 5 {movqi_virt}
 (expr_list:REG_DEAD (reg:QI 46)
(nil)))
(insn 12 9 13 2 (set (reg:QI 49 [ MEM[(volatile union un_per0 *)240B].BIT.no4 ])
(mem/v/j:QI (reg/f:HI 43) [0 MEM[(volatile union un_per0 
*)240B].BIT.no4+0 S1 A16])) test.c:26 5 {movqi_virt}
 (nil))
(insn 13 12 14 2 (set (reg:QI 50)
(and:QI (reg:QI 49 [ MEM[(volatile union un_per0 *)240B].BIT.no4 ])
(const_int -17 [0xffef]))) test.c:26 18 {*andqi3_virt}
 (expr_list:REG_DEAD (reg:QI 49 [ MEM[(volatile union un_per0 
*)240B].BIT.no4 ])
(nil)))
(insn 14 13 0 2 (set (mem/v/j:QI (reg/f:HI 43) [0 MEM[(volatile union un_per0 
*)240B].BIT.no4+0 S1 A16])
(reg:QI 50)) test.c:26 5 {movqi_virt}
 (expr_list:REG_DEAD (reg:QI 50)
(expr_list:REG_DEAD (reg/f:HI 43)
(nil

Combine gets as far as this:

Trying 5 -> 9:
Failed to match this instruction:
(parallel [
(set (mem/v/j:QI (const_int 240 [0xf0]) [0 MEM[(volatile union un_per0 
*)240B].BIT.no4+0 S1 A16])
(ior:QI (mem/v/j:QI (const_int 240 [0xf0]) [0 MEM[(volatile union 
un_per0 *)240B].BIT.no4+0 S1 A16])
(const_int 16 [0x10])))
(set (reg/f:HI 43)
(const_int 240 [0xf0]))
])

(the set is left behind because it's used for the second assignment)

Both of those insns in the parallel are valid rl78 insns.  I tried
adding that parallel as a define-and-split but combine doesn't split
it at the point where it inserts it, so it doesn't work right.  If it
reduced those four instructions to the two in the parallel, but
without the parallel, it would probably work too.

We end up with code like this:

movwr8, #240 ; 5*movhi_real/4   [length 
= 4]
movwax, r8   ; 19   *movhi_real/5   [length 
= 4]
movwhl, ax   ; 21   *movhi_real/6   [length 
= 4]
set1[hl].4   ; 9*iorqi3_real/1  [length 
= 4]
clr1[hl].4   ; 14   *andqi3_real/1  [length 
= 4]

but what we want is this:

set1!240.4   ; 9*iorqi3_real/1  [length 
= 4]
clr1!240.4   ; 14   *andqi3_real/1  [length 
= 4]

( !240 means (mem (const_int 240)) )

(if there's only one such operation in a function, it combines
properly, likely because the address is not needed after the insn it
can combine, unlike the parallel above)

The common addresses are separated at least before lowering to RTL; as the 
initial
expansion has:

;; MEM[(volatile union un_per0 *)240B].BIT.no4 ={v} 1;

(insn 5 4 7 (set (reg/f:HI 43)
(const_int 240 [0xf0])) test.c:24 -1
 (nil))

(insn 7 5 8 (set (reg:QI 45)
(mem/v/j:QI (reg/f:HI 43) [0 MEM[(volatile union un_per0 
*)240B].BIT.no4+0 S1 A16])) test.c:24 -1
 (nil))

(insn 8 7 9 (set (reg:QI 46)
(ior:QI (reg:QI 45)
(const_int 16 [0x10]))) test.c:24 -1
 (nil))

(insn 9 8 0 (set (mem/v/j:QI (reg/f:HI 43) [0 MEM[(volatile union un_per0 
*)240B].BIT.no4+0 S1 A16])
(reg:QI 46)) test.c:24 -1
 (nil))


Yes, I know gcc doesn't like combining volatile accesses into one
insn, but the rl78 backend (my copy at least) has predicates that
allow it, because it's safe on rl78.

Also, if I take out the "volatile" yet put some sort of barrier (like
a volatile asm) between the two assignments, it still fails, in the
same manner.


Re: rl78 vs cse vs memory_address_addr_space

2015-07-06 Thread Segher Boessenkool
On Mon, Jul 06, 2015 at 04:45:35PM -0400, DJ Delorie wrote:
> Combine gets as far as this:
> 
> Trying 5 -> 9:
> Failed to match this instruction:
> (parallel [
> (set (mem/v/j:QI (const_int 240 [0xf0]) [0 MEM[(volatile union 
> un_per0 *)240B].BIT.no4+0 S1 A16])
> (ior:QI (mem/v/j:QI (const_int 240 [0xf0]) [0 MEM[(volatile union 
> un_per0 *)240B].BIT.no4+0 S1 A16])
> (const_int 16 [0x10])))
> (set (reg/f:HI 43)
> (const_int 240 [0xf0]))
> ])
> 
> (the set is left behind because it's used for the second assignment)
> 
> Both of those insns in the parallel are valid rl78 insns.  I tried
> adding that parallel as a define-and-split but combine doesn't split
> it at the point where it inserts it, so it doesn't work right.  If it
> reduced those four instructions to the two in the parallel, but
> without the parallel, it would probably work too.

Did you try just a define_split instead?  Ugly, but it should work I think.


Segher


Does GCC generate LDRD/STRD (Register) forms?

2015-07-06 Thread Anmol Paralkar (anmparal)
Hello,

Does GCC generate LDRD/STRD (Register) forms [A8.8.74/A8.8.211 per ARMv7-A
& ARMv7-R ARM]?

Based on various attempts to write code to get GCC to generate a sample
form, and subsequently inspecting the code I see in
config/arm/arm.c/output_move_double () & arm.md [GCC 4.9.2], I think that
these register based forms of LDRD/STRD are
not generated, but I thought it might be a good idea to ask on the list,
just in case.

Thanks,
Anmol.



Re: Does GCC generate LDRD/STRD (Register) forms?

2015-07-06 Thread Bin.Cheng
On Tue, Jul 7, 2015 at 10:05 AM, Anmol Paralkar (anmparal)
 wrote:
> Hello,
>
> Does GCC generate LDRD/STRD (Register) forms [A8.8.74/A8.8.211 per ARMv7-A
> & ARMv7-R ARM]?
>
> Based on various attempts to write code to get GCC to generate a sample
> form, and subsequently inspecting the code I see in
> config/arm/arm.c/output_move_double () & arm.md [GCC 4.9.2], I think that
> these register based forms of LDRD/STRD are
> not generated, but I thought it might be a good idea to ask on the list,
> just in case.
Register based LDRD is harder than immediate version.  ARM doesn't
support [base + reg + offset] addressing mode, so address computation
of the second memory reference is scattered both in and out of memory
reference.  To identify such opportunities, one needs to trace
registers in address expression the memory access instruction and does
some kind of value computation and re-association.

Thanks,
bin
>
> Thanks,
> Anmol.
>


Re: rl78 vs cse vs memory_address_addr_space

2015-07-06 Thread DJ Delorie

> Did you try just a define_split instead?  Ugly, but it should work I think.

It doesn't seem to be able to match a define_split :-(