Re: [c++std-parallel-1614] Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Jens Maurer
On 05/20/2015 04:34 AM, Paul E. McKenney wrote:
> On Tue, May 19, 2015 at 06:57:02PM -0700, Linus Torvalds wrote:

>>  - the "you can add/subtract integral values" still opens you up to
>> language lawyers claiming "(char *)ptr - (intptr_t)ptr" preserving the
>> dependency, which it clearly doesn't. But language-lawyering it does,
>> since all those operations (cast to pointer, cast to integer,
>> subtracting an integer) claim to be dependency-preserving operations.

[...]

> There are some stranger examples, such as "(char *)ptr - ((intptr_t)ptr)/7",
> but in that case, if the resulting pointer happens by chance to reference 
> valid memory, I believe a dependency would still be carried.
[...]

>From a language lawyer standpoint, pointer arithmetic is only valid
within an array.  These examples seem to go beyond the bounds of the
array and therefore have undefined behavior.

C++ standard section 5.7 paragraph 4
"If both the pointer operand and the result point to elements of the
same array object, or one past the last element of the array object,
the evaluation shall not produce an overflow; otherwise, the behavior
is undefined."

C99 and C11
identical phrasing in 6.5.6 paragraph 8

Jens


Re: [i386] Scalar DImode instructions on XMM registers

2015-05-20 Thread Ilya Enkovich
On 19 May 11:22, Vladimir Makarov wrote:
> On 05/18/2015 08:13 AM, Ilya Enkovich wrote:
> >2015-05-06 17:18 GMT+03:00 Ilya Enkovich :
> >Hi Vladimir,
> >
> >Could you please comment on this?
> >
> >
> Ilya, I think that the idea is worth to try but results might be
> mixed.  It is hard to say until you actually try it (as example, Jan
> implemented -fpmath=both and it looks a pretty good idea at least
> for me but when I checked SPEC2000 the results were not so good even
> with IRA/LRA).
> 
> Long ago I did some experiments and found that spilling into SSE
> would benefitial for Intel CPUs but not for AMD ones.  As I remember
> I also found that storing several scalar values into one SSE reg and
> extracting it when you need to do some (fp) arithmetics would
> benefitial for AMD but not for Intel CPUs.   In literature more
> general approach is called bitwise register allocator.  Actually it
> would be a pretty big IRA/LRA project from which some targets might
> benefit.

I suspect such things are not trivially done in IRA/LRA and want to make it as 
an independent optimization because its application seems to be quite narrow.

> 
> 
> As for the wrong code, it is hard for me to say anything w/o RA
> dumps.  If you send me the dump (-fira-verbose=16), i might say more
> what is going on.
> 
> 

Here are some dumps from my reproducer.  The problematic register is r108.

Thanks,
Ilya

;; Function test (test, funcdef_no=0, decl_uid=1933, cgraph_uid=0, 
symbol_order=0)

scanning new insn with uid = 79.
starting the processing of deferred insns
ending the processing of deferred insns
df_analyze called
df_worklist_dataflow_doublequeue:n_basic_blocks 5 n_edges 6 count 5 (1)
starting the processing of deferred insns
ending the processing of deferred insns
df_analyze called
Reg 119: local to bb 2 def dominates all uses has unique first use
Reg 125 uninteresting
Reg 118: local to bb 2 def dominates all uses has unique first use
Reg 126 uninteresting
Reg 127 uninteresting
Found def insn 26 for 119 to be not moveable
;; 2 loops found
;;
;; Loop 0
;;  header 0, latch 1
;;  depth 0, outer -1
;;  nodes: 0 1 2 3 4
;;
;; Loop 1
;;  header 3, latch 3
;;  depth 1, outer 0
;;  nodes: 3
;; 2 succs { 3 4 }
;; 3 succs { 3 4 }
;; 4 succs { 1 }
starting the processing of deferred insns
ending the processing of deferred insns
df_analyze called
init_insns for 117: (insn_list:REG_DEP_TRUE 22 (nil))


test

Dataflow summary:
;;  invalidated by call  0 [ax] 1 [dx] 2 [cx] 8 [st] 9 [st(1)] 10 
[st(2)] 11 [st(3)] 12 [st(4)] 13 [st(5)] 14 [st(6)] 15 [st(7)] 17 [flags] 18 
[fpsr] 19 [fpcr] 21 [xmm0] 22 [xmm1] 23 [xmm2] 24 [xmm3] 25 [xmm4] 26 [xmm5] 27 
[xmm6] 28 [xmm7] 29 [mm0] 30 [mm1] 31 [mm2] 32 [mm3] 33 [mm4] 34 [mm5] 35 [mm6] 
36 [mm7] 37 [] 38 [] 39 [] 40 [] 41 [] 42 [] 43 [] 44 [] 45 [] 46 [] 47 [] 48 
[] 49 [] 50 [] 51 [] 52 [] 53 [] 54 [] 55 [] 56 [] 57 [] 58 [] 59 [] 60 [] 61 
[] 62 [] 63 [] 64 [] 65 [] 66 [] 67 [] 68 [] 69 [] 70 [] 71 [] 72 [] 73 [] 74 
[] 75 [] 76 [] 77 [] 78 [] 79 [] 80 []
;;  hardware regs used   7 [sp] 16 [argp] 20 [frame]
;;  regular block artificial uses6 [bp] 7 [sp] 16 [argp] 20 [frame]
;;  eh block artificial uses 6 [bp] 7 [sp] 16 [argp] 20 [frame]
;;  entry block defs 0 [ax] 1 [dx] 2 [cx] 6 [bp] 7 [sp] 16 [argp] 20 
[frame] 21 [xmm0] 22 [xmm1] 23 [xmm2] 29 [mm0] 30 [mm1] 31 [mm2]
;;  exit block uses  6 [bp] 7 [sp] 20 [frame]
;;  regs ever live   3[bx] 7[sp] 17[flags]
;;  ref usage   r0={2d} r1={2d} r2={2d} r3={1d,1u} r6={1d,4u} r7={1d,7u} 
r8={1d} r9={1d} r10={1d} r11={1d} r12={1d} r13={1d} r14={1d} r15={1d} 
r16={1d,4u,1e} r17={5d,2u} r18={1d} r19={1d} r20={1d,4u} r21={2d} r22={2d} 
r23={2d} r24={1d} r25={1d} r26={1d} r27={1d} r28={1d} r29={2d} r30={2d} 
r31={2d} r32={1d} r33={1d} r34={1d} r35={1d} r36={1d} r37={1d} r38={1d} 
r39={1d} r40={1d} r41={1d} r42={1d} r43={1d} r44={1d} r45={1d} r46={1d} 
r47={1d} r48={1d} r49={1d} r50={1d} r51={1d} r52={1d} r53={1d} r54={1d} 
r55={1d} r56={1d} r57={1d} r58={1d} r59={1d} r60={1d} r61={1d} r62={1d} 
r63={1d} r64={1d} r65={1d} r66={1d} r67={1d} r68={1d} r69={1d} r70={1d} 
r71={1d} r72={1d} r73={1d} r74={1d} r75={1d} r76={1d} r77={1d} r78={1d} 
r79={1d} r80={1d} r107={1d,1u} r108={2d,4u} r117={2d,5u,2e} r118={1d,1u} 
r119={1d,1u} r123={2d,3u} r124={2d,3u} r125={1d,1u} r126={1d,1u} r127={1d,1u} 
r128={2d,2u} r129={2d,2u} 
;;total ref usage 160{110d,47u,3e} in 25{24 regular + 1 call} insns.
(note 21 0 24 NOTE_INSN_DELETED)
(note 24 21 79 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
(insn/f 79 24 22 2 (parallel [
(set (reg:SI 107)
(unspec:SI [
(const_int 0 [0])
] UNSPEC_SET_GOT))
(clobber (reg:CC 17 flags))
]) 694 {set_got}
 (expr_list:REG_UNUSED (reg:CC 17 flags)
(expr_list:REG_EQUIV (unspec:SI [
(const_int 0 [0])
] UNSPEC_SET_GOT)
(expr_list:REG_CFA_FLUSH_QUEUE (nil)
(nil))

Re: optimization question

2015-05-20 Thread Richard Biener
On Mon, May 18, 2015 at 10:01 PM, mark maule  wrote:
> I have a loop which hangs when compiled with -O2, but runs fine when
> compiled with -O1.  Not sure what information is required to get an answer,
> so starting with the full src code.  I have not attempted to reduce to a
> simpler test case yet.
>
> Attachments:
>
> bs_destage.c - full source code
> bs_destage.dis.O2 - gdb disassembly of bs_destageLoop()
> bs_destage.dis+m.O2 - src annotated version of the above
>
> The function in question is bs_destageSearch().  When I compile bs_destage.c
> with -O2, it seems that the dgHandle condition at line 741 is being ignored,
> leading to an infinite loop.  I can see in the disassembly that dgHandle is
> still in the code as a 16-bit value stored at 0x32(%rsp), and a running
> 32-bit copy stored at 0x1c(%rsp).  I can also see that the 16 bit version at
> 0x32(%rsp) is being incremented at the end of the loop, but I don't see
> anywhere in the code where either version of dgHandle is being used when
> determining if the while() at 741 should be continued.
>
> I'm not very familiar with the optimizations that are done in O2 vs O1, or
> even what happens in these optimizations.
>
> So, I'm wondering if this is a bug, or a subtle valid optimization that I
> don't understand.  Any help would be appreciated.
>
> Note:  changing the declaration of dgHandle to be volitile appears to modify
> the code sufficiently that it looks like the dgHandle check is honored (have
> not tested).
>
> Thanks in advance for any help/advice.

The usual issue with this kind of behavior is out-of-bound accesses of
arrays in a loop
or invoking undefined behavior when signed integer operations wrap.


   uint32_toutLun[ BS_CFG_DRIVE_GROUPS ];

and

  while ( ( dgHandle < ( BS_CFG_DRIVE_GROUPS + 1 ) ) &&
...
 dgDestageOut = bs_destageData.outLun[ dgHandle ];

looks like this might access outLun[BS_CFG_DRIVE_GROUPS] which is
out-of-bounds.

Richard.

> Mark Maule
>
> gcc version:
>
> Using built-in specs.
> COLLECT_GCC=gcc
> COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.3/lto-wrapper
> Target: x86_64-redhat-linux
> Configured with: ../configure --prefix=/usr --mandir=/usr/share/man
> --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla
> --enable-bootstrap --enable-shared --enable-threads=posix
> --enable-checking=release --with-system-zlib --enable-__cxa_atexit
> --disable-libunwind-exceptions --enable-gnu-unique-object
> --enable-linker-build-id --with-linker-hash-style=gnu
> --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto
> --enable-plugin --enable-initfini-array --disable-libgcj
> --with-isl=/builddir/build/BUILD/gcc-4.8.3-20140911/obj-x86_64-redhat-linux/isl-install
> --with-cloog=/builddir/build/BUILD/gcc-4.8.3-20140911/obj-x86_64-redhat-linux/cloog-install
> --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64
> --build=x86_64-redhat-linux
> Thread model: posix
> gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC)
>


Re: [c++std-parallel-1614] Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Richard Biener
On Wed, May 20, 2015 at 9:34 AM, Jens Maurer  wrote:
> On 05/20/2015 04:34 AM, Paul E. McKenney wrote:
>> On Tue, May 19, 2015 at 06:57:02PM -0700, Linus Torvalds wrote:
>
>>>  - the "you can add/subtract integral values" still opens you up to
>>> language lawyers claiming "(char *)ptr - (intptr_t)ptr" preserving the
>>> dependency, which it clearly doesn't. But language-lawyering it does,
>>> since all those operations (cast to pointer, cast to integer,
>>> subtracting an integer) claim to be dependency-preserving operations.
>
> [...]
>
>> There are some stranger examples, such as "(char *)ptr - ((intptr_t)ptr)/7",
>> but in that case, if the resulting pointer happens by chance to reference
>> valid memory, I believe a dependency would still be carried.
> [...]
>
> From a language lawyer standpoint, pointer arithmetic is only valid
> within an array.  These examples seem to go beyond the bounds of the
> array and therefore have undefined behavior.
>
> C++ standard section 5.7 paragraph 4
> "If both the pointer operand and the result point to elements of the
> same array object, or one past the last element of the array object,
> the evaluation shall not produce an overflow; otherwise, the behavior
> is undefined."
>
> C99 and C11
> identical phrasing in 6.5.6 paragraph 8

Of course you can try to circumvent that by doing
(char*)((intptr_t)ptr - (intptr_t)ptr + (intptr_t)ptr)
(see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65752 for extra fun).

Which (IMHO) gets you into the standard language that only makes conversion of
the exact same integer back to a pointer well-defined(?)

Richard.

> Jens


Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Will Deacon
Hi Paul,

On Wed, May 20, 2015 at 03:41:48AM +0100, Paul E. McKenney wrote:
> On Tue, May 19, 2015 at 07:10:12PM -0700, Linus Torvalds wrote:
> > On Tue, May 19, 2015 at 6:57 PM, Linus Torvalds
> >  wrote:
> > So I think you're better off just saying that operations designed to
> > drop significant bits break the dependency chain, and give things like
> > "& 1" and "(char *)ptr-(uintptr_t)ptr" as examples of such.
> > 
> > Making that just an extension of your existing "& 0" language would
> > seem to be natural.
> 
> Works for me!  I added the following bullet to the list of things
> that break dependencies:
> 
>   If a pointer is part of a dependency chain, and if the values
>   added to or subtracted from that pointer cancel the pointer
>   value so as to allow the compiler to precisely determine the
>   resulting value, then the resulting value will not be part of
>   any dependency chain.  For example, if p is part of a dependency
>   chain, then ((char *)p-(uintptr_t)p)+65536 will not be.
> 
> Seem reasonable?

Whilst I understand what you're saying (the ARM architecture makes these
sorts of distinctions when calling out dependency-based ordering), it
feels like we're dangerously close to defining the difference between a
true and a false dependency. If we want to do this in the context of the
C language specification, you run into issues because you need to evaluate
the program in order to determine data values in order to determine the
nature of the dependency.

You tackle this above by saying "to allow the compiler to precisely
determine the resulting value", but I can't see how that can be cleanly
fitted into something like the C language specification. Even if it can,
then we'd need to reword the "?:" treatment that you currently have:

  "If a pointer is part of a dependency chain, and that pointer appears
   in the entry of a ?: expression selected by the condition, then the
   chain extends to the result."

which I think requires the state of the condition to be known statically
if we only want to extend the chain from the selected expression. In the
general case, wouldn't a compiler have to assume that the chain is
extended from both?

Additionally, what about the following code?

  char *x = y ? z : z;

Does that extend a dependency chain from z to x? If so, I can imagine a
CPU breaking that in practice.

> > Humans will understand, and compiler writers won't care. They will
> > either depend on hardware semantics anyway (and argue that your
> > language is tight enough that they don't need to do anything special)
> > or they will turn the consume into an acquire (on platforms that have
> > too weak hardware).
> 
> Agreed.  Plus Core Working Group will hammer out the exact wording,
> should this approach meet their approval.

For the avoidance of doubt, I'm completely behind any attempts to tackle
this problem, but I anticipate an uphill struggle getting this text into
the C standard. Is your intention to change the carries-a-dependency
relation to encompass this change?

Cheers,

Will


Re: [c++std-parallel-1616] Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Paul E. McKenney
On Wed, May 20, 2015 at 09:34:10AM +0200, Jens Maurer wrote:
> On 05/20/2015 04:34 AM, Paul E. McKenney wrote:
> > On Tue, May 19, 2015 at 06:57:02PM -0700, Linus Torvalds wrote:
> 
> >>  - the "you can add/subtract integral values" still opens you up to
> >> language lawyers claiming "(char *)ptr - (intptr_t)ptr" preserving the
> >> dependency, which it clearly doesn't. But language-lawyering it does,
> >> since all those operations (cast to pointer, cast to integer,
> >> subtracting an integer) claim to be dependency-preserving operations.
> 
> [...]
> 
> > There are some stranger examples, such as "(char *)ptr - ((intptr_t)ptr)/7",
> > but in that case, if the resulting pointer happens by chance to reference 
> > valid memory, I believe a dependency would still be carried.
> [...]
> 
> >From a language lawyer standpoint, pointer arithmetic is only valid
> within an array.  These examples seem to go beyond the bounds of the
> array and therefore have undefined behavior.
> 
> C++ standard section 5.7 paragraph 4
> "If both the pointer operand and the result point to elements of the
> same array object, or one past the last element of the array object,
> the evaluation shall not produce an overflow; otherwise, the behavior
> is undefined."
> 
> C99 and C11
> identical phrasing in 6.5.6 paragraph 8

Even better!  I added a footnote calling out these two paragraphs.

Thax, Paul



Re: [c++std-parallel-1614] Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Paul E. McKenney
On Wed, May 20, 2015 at 11:03:00AM +0200, Richard Biener wrote:
> On Wed, May 20, 2015 at 9:34 AM, Jens Maurer  wrote:
> > On 05/20/2015 04:34 AM, Paul E. McKenney wrote:
> >> On Tue, May 19, 2015 at 06:57:02PM -0700, Linus Torvalds wrote:
> >
> >>>  - the "you can add/subtract integral values" still opens you up to
> >>> language lawyers claiming "(char *)ptr - (intptr_t)ptr" preserving the
> >>> dependency, which it clearly doesn't. But language-lawyering it does,
> >>> since all those operations (cast to pointer, cast to integer,
> >>> subtracting an integer) claim to be dependency-preserving operations.
> >
> > [...]
> >
> >> There are some stranger examples, such as "(char *)ptr - 
> >> ((intptr_t)ptr)/7",
> >> but in that case, if the resulting pointer happens by chance to reference
> >> valid memory, I believe a dependency would still be carried.
> > [...]
> >
> > From a language lawyer standpoint, pointer arithmetic is only valid
> > within an array.  These examples seem to go beyond the bounds of the
> > array and therefore have undefined behavior.
> >
> > C++ standard section 5.7 paragraph 4
> > "If both the pointer operand and the result point to elements of the
> > same array object, or one past the last element of the array object,
> > the evaluation shall not produce an overflow; otherwise, the behavior
> > is undefined."
> >
> > C99 and C11
> > identical phrasing in 6.5.6 paragraph 8
> 
> Of course you can try to circumvent that by doing
> (char*)((intptr_t)ptr - (intptr_t)ptr + (intptr_t)ptr)
> (see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65752 for extra fun).
> 
> Which (IMHO) gets you into the standard language that only makes conversion of
> the exact same integer back to a pointer well-defined(?)

I am feeling good about leaving the restriction and calling out
the two paragraphs in a footnote, then.  ;-)

Thanx, Paul



Re: optimization question

2015-05-20 Thread mark maule



On 5/20/2015 3:27 AM, Richard Biener wrote:

On Mon, May 18, 2015 at 10:01 PM, mark maule  wrote:

I have a loop which hangs when compiled with -O2, but runs fine when
compiled with -O1.  Not sure what information is required to get an answer,
so starting with the full src code.  I have not attempted to reduce to a
simpler test case yet.

Attachments:

bs_destage.c - full source code
bs_destage.dis.O2 - gdb disassembly of bs_destageLoop()
bs_destage.dis+m.O2 - src annotated version of the above

The function in question is bs_destageSearch().  When I compile bs_destage.c
with -O2, it seems that the dgHandle condition at line 741 is being ignored,
leading to an infinite loop.  I can see in the disassembly that dgHandle is
still in the code as a 16-bit value stored at 0x32(%rsp), and a running
32-bit copy stored at 0x1c(%rsp).  I can also see that the 16 bit version at
0x32(%rsp) is being incremented at the end of the loop, but I don't see
anywhere in the code where either version of dgHandle is being used when
determining if the while() at 741 should be continued.

I'm not very familiar with the optimizations that are done in O2 vs O1, or
even what happens in these optimizations.

So, I'm wondering if this is a bug, or a subtle valid optimization that I
don't understand.  Any help would be appreciated.

Note:  changing the declaration of dgHandle to be volitile appears to modify
the code sufficiently that it looks like the dgHandle check is honored (have
not tested).

Thanks in advance for any help/advice.

The usual issue with this kind of behavior is out-of-bound accesses of
arrays in a loop
or invoking undefined behavior when signed integer operations wrap.


uint32_toutLun[ BS_CFG_DRIVE_GROUPS ];

and

   while ( ( dgHandle < ( BS_CFG_DRIVE_GROUPS + 1 ) ) &&
...
  dgDestageOut = bs_destageData.outLun[ dgHandle ];

looks like this might access outLun[BS_CFG_DRIVE_GROUPS] which is
out-of-bounds.

Richard.


You are correct, and when I change outLun[] to be size 
BS_CFG_DRIVE_GROUPS+1, the generated asm looks like it will account for 
dgHandle in the while() loop.  I will pass this back to our development 
team to get a proper fix.


Now, the followon:  Something in the compiler/optimizer recognized this 
out of bounds situation - should a warning have been emitted? Or are 
there ambiguities which make a warning generation here inappropriate?


And an additional question:  It still seems wrong to omit the dgHandle 
check from the while() condition vs. leaving it in and letting the code 
access beyond the end of the array.  Is this one of those areas where if 
there's a bug in the code all bets are off and your mileage may vary?


Thanks to everyone who helped me out here.

Mark



Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Paul E. McKenney
On Wed, May 20, 2015 at 12:47:45PM +0100, Will Deacon wrote:
> Hi Paul,
> 
> On Wed, May 20, 2015 at 03:41:48AM +0100, Paul E. McKenney wrote:
> > On Tue, May 19, 2015 at 07:10:12PM -0700, Linus Torvalds wrote:
> > > On Tue, May 19, 2015 at 6:57 PM, Linus Torvalds
> > >  wrote:
> > > So I think you're better off just saying that operations designed to
> > > drop significant bits break the dependency chain, and give things like
> > > "& 1" and "(char *)ptr-(uintptr_t)ptr" as examples of such.
> > > 
> > > Making that just an extension of your existing "& 0" language would
> > > seem to be natural.
> > 
> > Works for me!  I added the following bullet to the list of things
> > that break dependencies:
> > 
> > If a pointer is part of a dependency chain, and if the values
> > added to or subtracted from that pointer cancel the pointer
> > value so as to allow the compiler to precisely determine the
> > resulting value, then the resulting value will not be part of
> > any dependency chain.  For example, if p is part of a dependency
> > chain, then ((char *)p-(uintptr_t)p)+65536 will not be.
> > 
> > Seem reasonable?
> 
> Whilst I understand what you're saying (the ARM architecture makes these
> sorts of distinctions when calling out dependency-based ordering), it
> feels like we're dangerously close to defining the difference between a
> true and a false dependency. If we want to do this in the context of the
> C language specification, you run into issues because you need to evaluate
> the program in order to determine data values in order to determine the
> nature of the dependency.

Indeed, something like this does -not- carry a dependency from the
memory_order_consume load to q:

char *p, q;

p = atomic_load_explicit(&gp, memory_order_consume);
q = gq + (intptr_t)p - (intptr_t)p;

If this was compiled with -O0, ARM and Power might well carry a
dependency, but given any optimization, the assembly language would have
no hint of any such dependency.  So I am not seeing any particular danger.

> You tackle this above by saying "to allow the compiler to precisely
> determine the resulting value", but I can't see how that can be cleanly
> fitted into something like the C language specification.

I am sure that there will be significant rework from where this document
is to language appropriate from the standard.  Which is why I am glad
that Jens is taking an interest in this, as he is particularly good at
producing standards language.

>  Even if it can,
> then we'd need to reword the "?:" treatment that you currently have:
> 
>   "If a pointer is part of a dependency chain, and that pointer appears
>in the entry of a ?: expression selected by the condition, then the
>chain extends to the result."
> 
> which I think requires the state of the condition to be known statically
> if we only want to extend the chain from the selected expression. In the
> general case, wouldn't a compiler have to assume that the chain is
> extended from both?

In practice, yes, if the compiler cannot determine which expression is
selected, it must arrange for the dependency to be carried from either,
depending on the run-time value of the condition.  But you would have
to work pretty hard to create code that did not carry the dependencies
as require, not?

> Additionally, what about the following code?
> 
>   char *x = y ? z : z;
> 
> Does that extend a dependency chain from z to x? If so, I can imagine a
> CPU breaking that in practice.

I am not seeing this.  I would expect the compiler to optimize to
something like this:

char *x = z;

How does this avoid carrying the dependency?  Or are you saying that
ARM loses the dependency via a store to memory and a later reload?
That would be a bit surprising...

> > > Humans will understand, and compiler writers won't care. They will
> > > either depend on hardware semantics anyway (and argue that your
> > > language is tight enough that they don't need to do anything special)
> > > or they will turn the consume into an acquire (on platforms that have
> > > too weak hardware).
> > 
> > Agreed.  Plus Core Working Group will hammer out the exact wording,
> > should this approach meet their approval.
> 
> For the avoidance of doubt, I'm completely behind any attempts to tackle
> this problem, but I anticipate an uphill struggle getting this text into
> the C standard. Is your intention to change the carries-a-dependency
> relation to encompass this change?

I completely agree that this won't be easy, but this is the task at hand.
And yes, the intent is to change carries-a-dependency, given that the
current wording isn't helping anything.  ;-)

Thanx, Paul



Re: optimization question

2015-05-20 Thread Andrew Haley
On 05/20/2015 01:04 PM, mark maule wrote:
> Is this one of those areas where if 
> there's a bug in the code all bets are off and your mileage may vary?

Yes.  Do not access beyond the end of an array: daemons may fly out
of your nose. [1]

Andrew.

[1] 
https://groups.google.com/forum/?hl=en#!msg/comp.std.c/ycpVKxTZkgw/S2hHdTbv4d8J



Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread David Howells
Paul E. McKenney  wrote:

> > Additionally, what about the following code?
> > 
> >   char *x = y ? z : z;
> > 
> > Does that extend a dependency chain from z to x? If so, I can imagine a
> > CPU breaking that in practice.
> 
> I am not seeing this.  I would expect the compiler to optimize to
> something like this:
> 
>   char *x = z;

Why?  What if y has a potential side-effect (say it makes a function call)?

David


Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Paul E. McKenney
On Wed, May 20, 2015 at 02:18:37PM +0100, David Howells wrote:
> Paul E. McKenney  wrote:
> 
> > > Additionally, what about the following code?
> > > 
> > >   char *x = y ? z : z;
> > > 
> > > Does that extend a dependency chain from z to x? If so, I can imagine a
> > > CPU breaking that in practice.
> > 
> > I am not seeing this.  I would expect the compiler to optimize to
> > something like this:
> > 
> > char *x = z;
> 
> Why?  What if y has a potential side-effect (say it makes a function call)?

I was thinking of "y" as a simple variable, but if it is something more
complex, then the compiler could do this, right?

char *x;

y;
x = z;

Thanx, Paul



Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread David Howells
Paul E. McKenney  wrote:

> I was thinking of "y" as a simple variable, but if it is something more
> complex, then the compiler could do this, right?
> 
>   char *x;
> 
>   y;
>   x = z;

Yeah.  I presume it has to maintain the ordering, though.

David


Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Ramana Radhakrishnan



On 20/05/15 14:37, David Howells wrote:

Paul E. McKenney  wrote:


I was thinking of "y" as a simple variable, but if it is something more
complex, then the compiler could do this, right?

char *x;

y;
x = z;


Yeah.  I presume it has to maintain the ordering, though.


The scheduler for e.g. is free to reorder if it can prove there is no 
dependence (or indeed side-effects for y) between insns produced for y 
and `x = z'.


regards
Ramana



David



Re: [c++std-parallel-1624] Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Paul E. McKenney
On Wed, May 20, 2015 at 02:37:05PM +0100, David Howells wrote:
> Paul E. McKenney  wrote:
> 
> > I was thinking of "y" as a simple variable, but if it is something more
> > complex, then the compiler could do this, right?
> > 
> > char *x;
> > 
> > y;
> > x = z;
> 
> Yeah.  I presume it has to maintain the ordering, though.

Agreed.  Unless of course y writes to x or some such.

Given that there is already code in the Linux kernel relying on
dependencies being carried through stores to local variables,
this should not be a problem.

Or am I missing something?

Thanx, Paul



Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Paul E. McKenney
On Wed, May 20, 2015 at 02:44:30PM +0100, Ramana Radhakrishnan wrote:
> 
> 
> On 20/05/15 14:37, David Howells wrote:
> >Paul E. McKenney  wrote:
> >
> >>I was thinking of "y" as a simple variable, but if it is something more
> >>complex, then the compiler could do this, right?
> >>
> >>char *x;
> >>
> >>y;
> >>x = z;
> >
> >Yeah.  I presume it has to maintain the ordering, though.
> 
> The scheduler for e.g. is free to reorder if it can prove there is
> no dependence (or indeed side-effects for y) between insns produced
> for y and `x = z'.

So for example, if y is independent of z, the compiler can do the
following:

char *x;

x = z;
y;

But the dependency ordering is still maintained from z to x, so this
is not a problem.

Or am I missing something subtle here?

Thanx, Paul



Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Ramana Radhakrishnan



On 20/05/15 15:03, Paul E. McKenney wrote:

On Wed, May 20, 2015 at 02:44:30PM +0100, Ramana Radhakrishnan wrote:



On 20/05/15 14:37, David Howells wrote:

Paul E. McKenney  wrote:


I was thinking of "y" as a simple variable, but if it is something more
complex, then the compiler could do this, right?

char *x;

y;
x = z;


Yeah.  I presume it has to maintain the ordering, though.


The scheduler for e.g. is free to reorder if it can prove there is
no dependence (or indeed side-effects for y) between insns produced
for y and `x = z'.


So for example, if y is independent of z, the compiler can do the
following:

char *x;

x = z;
y;

But the dependency ordering is still maintained from z to x, so this
is not a problem.



Well, reads if any of x (assuming x was initialized elsewhere) would 
need to happen before x got assigned to z.


I understood the original "maintain the ordering" as between the 
statements `x = z' and `y'.





Or am I missing something subtle here?


No, it sounds like we are on the same page here.

regards
Ramana



Thanx, Paul



Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Paul E. McKenney
On Wed, May 20, 2015 at 03:15:48PM +0100, Ramana Radhakrishnan wrote:
> 
> 
> On 20/05/15 15:03, Paul E. McKenney wrote:
> >On Wed, May 20, 2015 at 02:44:30PM +0100, Ramana Radhakrishnan wrote:
> >>
> >>
> >>On 20/05/15 14:37, David Howells wrote:
> >>>Paul E. McKenney  wrote:
> >>>
> I was thinking of "y" as a simple variable, but if it is something more
> complex, then the compiler could do this, right?
> 
>   char *x;
> 
>   y;
>   x = z;
> >>>
> >>>Yeah.  I presume it has to maintain the ordering, though.
> >>
> >>The scheduler for e.g. is free to reorder if it can prove there is
> >>no dependence (or indeed side-effects for y) between insns produced
> >>for y and `x = z'.
> >
> >So for example, if y is independent of z, the compiler can do the
> >following:
> >
> > char *x;
> >
> > x = z;
> > y;
> >
> >But the dependency ordering is still maintained from z to x, so this
> >is not a problem.
> 
> 
> Well, reads if any of x (assuming x was initialized elsewhere) would
> need to happen before x got assigned to z.

Agreed, there needs to be a memory_order_consume load up there somewhere.
(AKA rcu_dereference().)

> I understood the original "maintain the ordering" as between the
> statements `x = z' and `y'.

Ah, I was assuming between x and z.  David, what was your intent?  ;-)

> >Or am I missing something subtle here?
> 
> No, it sounds like we are on the same page here.

Whew!  ;-)

Thanx, Paul



Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread David Howells
Paul E. McKenney  wrote:

> Ah, I was assuming between x and z.  David, what was your intent?  ;-)

Clarification.

David


Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Will Deacon
On Wed, May 20, 2015 at 01:15:22PM +0100, Paul E. McKenney wrote:
> On Wed, May 20, 2015 at 12:47:45PM +0100, Will Deacon wrote:
> > On Wed, May 20, 2015 at 03:41:48AM +0100, Paul E. McKenney wrote:
> > >   If a pointer is part of a dependency chain, and if the values
> > >   added to or subtracted from that pointer cancel the pointer
> > >   value so as to allow the compiler to precisely determine the
> > >   resulting value, then the resulting value will not be part of
> > >   any dependency chain.  For example, if p is part of a dependency
> > >   chain, then ((char *)p-(uintptr_t)p)+65536 will not be.
> > > 
> > > Seem reasonable?
> > 
> > Whilst I understand what you're saying (the ARM architecture makes these
> > sorts of distinctions when calling out dependency-based ordering), it
> > feels like we're dangerously close to defining the difference between a
> > true and a false dependency. If we want to do this in the context of the
> > C language specification, you run into issues because you need to evaluate
> > the program in order to determine data values in order to determine the
> > nature of the dependency.
> 
> Indeed, something like this does -not- carry a dependency from the
> memory_order_consume load to q:
> 
>   char *p, q;
> 
>   p = atomic_load_explicit(&gp, memory_order_consume);
>   q = gq + (intptr_t)p - (intptr_t)p;
> 
> If this was compiled with -O0, ARM and Power might well carry a
> dependency, but given any optimization, the assembly language would have
> no hint of any such dependency.  So I am not seeing any particular danger.

The above is a welcome relaxation over C11, since ARM doesn't even give
you ordering based off false data dependencies. My concern is more to do
with how this can be specified precisely without prohibing honest compiler
and hardware optimisations.

Out of interest, how do you tackle examples (4) and (5) of (assuming the
reads are promoted to consume loads)?:

  http://www.cl.cam.ac.uk/~pes20/cpp/notes42.html

my understanding is that you permit both outcomes (I appreciate you're
not directly tackling out-of-thin-air, but treatment of dependencies
is heavily related).

> > You tackle this above by saying "to allow the compiler to precisely
> > determine the resulting value", but I can't see how that can be cleanly
> > fitted into something like the C language specification.
> 
> I am sure that there will be significant rework from where this document
> is to language appropriate from the standard.  Which is why I am glad
> that Jens is taking an interest in this, as he is particularly good at
> producing standards language.

Ok. I'm curious to see how that comes along.

> >  Even if it can,
> > then we'd need to reword the "?:" treatment that you currently have:
> > 
> >   "If a pointer is part of a dependency chain, and that pointer appears
> >in the entry of a ?: expression selected by the condition, then the
> >chain extends to the result."
> > 
> > which I think requires the state of the condition to be known statically
> > if we only want to extend the chain from the selected expression. In the
> > general case, wouldn't a compiler have to assume that the chain is
> > extended from both?
> 
> In practice, yes, if the compiler cannot determine which expression is
> selected, it must arrange for the dependency to be carried from either,
> depending on the run-time value of the condition.  But you would have
> to work pretty hard to create code that did not carry the dependencies
> as require, not?

I'm not sure... you'd require the compiler to perform static analysis of
loops to determine the state of the machine when they exit (if they exit!)
in order to show whether or not a dependency is carried to subsequent
operations. If it can't prove otherwise, it would have to assume that a
dependency *is* carried, and it's not clear to me how it would use this
information to restrict any subsequent dependency removing optimisations.

I guess that's one for the GCC folks.

> > Additionally, what about the following code?
> > 
> >   char *x = y ? z : z;
> > 
> > Does that extend a dependency chain from z to x? If so, I can imagine a
> > CPU breaking that in practice.
> 
> I am not seeing this.  I would expect the compiler to optimize to
> something like this:
> 
>   char *x = z;
> 
> How does this avoid carrying the dependency?  Or are you saying that
> ARM loses the dependency via a store to memory and a later reload?
> That would be a bit surprising...

I was thinking that the compiler would have to preserve the conditional
structure so that the dependency chain could be tracked correctly, but
if it can just assume that the dependency is carried regardless of y then
I agree that it doesn't matter for this code. All the CPU could do is
remove the conditional hazard.

> > > > Humans will understand, and compiler writers won't care. They will
> > > > either depend on hardware semant

Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Andrew Haley
On 05/20/2015 04:46 PM, Will Deacon wrote:
> I'm not sure... you'd require the compiler to perform static analysis of
> loops to determine the state of the machine when they exit (if they exit!)
> in order to show whether or not a dependency is carried to subsequent
> operations. If it can't prove otherwise, it would have to assume that a
> dependency *is* carried, and it's not clear to me how it would use this
> information to restrict any subsequent dependency removing optimisations.

It'd just convert consume to acquire.

Andrew.



Is there a way to adjust alignment of DImode and DFmode?

2015-05-20 Thread H.J. Lu
By default, alignment of DImode and DFmode is set to 8 bytes.
Intel MCU psABI specifies alignment of DImode and DFmode
to be 4 bytes. I'd like to make get_mode_alignment to return
32 bits for DImode and DFmode.   Is there a way to adjust alignment
of DImode and DFmode via ADJUST_ALIGNMENT?

-- 
H.J.


Re: Is there a way to adjust alignment of DImode and DFmode?

2015-05-20 Thread Paul_Koning

> On May 20, 2015, at 1:00 PM, H.J. Lu  wrote:
> 
> By default, alignment of DImode and DFmode is set to 8 bytes.

When did that change?  I know it was 4 in the past, unless you specifically 
passed a compile switch to make it 8.

paul



Re: Is there a way to adjust alignment of DImode and DFmode?

2015-05-20 Thread Jakub Jelinek
On Wed, May 20, 2015 at 05:19:28PM +, paul_kon...@dell.com wrote:
> 
> > On May 20, 2015, at 1:00 PM, H.J. Lu  wrote:
> > 
> > By default, alignment of DImode and DFmode is set to 8 bytes.
> 
> When did that change?  I know it was 4 in the past, unless you specifically 
> passed a compile switch to make it 8.

For i?86 that is only field alignment (i.e. inside of structs).

Jakub


Re: Is there a way to adjust alignment of DImode and DFmode?

2015-05-20 Thread Paul_Koning

> On May 20, 2015, at 1:22 PM, Jakub Jelinek  wrote:
> 
> On Wed, May 20, 2015 at 05:19:28PM +, paul_kon...@dell.com wrote:
>> 
>>> On May 20, 2015, at 1:00 PM, H.J. Lu  wrote:
>>> 
>>> By default, alignment of DImode and DFmode is set to 8 bytes.
>> 
>> When did that change?  I know it was 4 in the past, unless you specifically 
>> passed a compile switch to make it 8.
> 
> For i?86 that is only field alignment (i.e. inside of structs).

I missed that.  Thanks.

paul



Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Paul E. McKenney
On Wed, May 20, 2015 at 04:46:17PM +0100, Will Deacon wrote:
> On Wed, May 20, 2015 at 01:15:22PM +0100, Paul E. McKenney wrote:
> > On Wed, May 20, 2015 at 12:47:45PM +0100, Will Deacon wrote:
> > > On Wed, May 20, 2015 at 03:41:48AM +0100, Paul E. McKenney wrote:
> > > > If a pointer is part of a dependency chain, and if the values
> > > > added to or subtracted from that pointer cancel the pointer
> > > > value so as to allow the compiler to precisely determine the
> > > > resulting value, then the resulting value will not be part of
> > > > any dependency chain.  For example, if p is part of a dependency
> > > > chain, then ((char *)p-(uintptr_t)p)+65536 will not be.
> > > > 
> > > > Seem reasonable?
> > > 
> > > Whilst I understand what you're saying (the ARM architecture makes these
> > > sorts of distinctions when calling out dependency-based ordering), it
> > > feels like we're dangerously close to defining the difference between a
> > > true and a false dependency. If we want to do this in the context of the
> > > C language specification, you run into issues because you need to evaluate
> > > the program in order to determine data values in order to determine the
> > > nature of the dependency.
> > 
> > Indeed, something like this does -not- carry a dependency from the
> > memory_order_consume load to q:
> > 
> > char *p, q;
> > 
> > p = atomic_load_explicit(&gp, memory_order_consume);
> > q = gq + (intptr_t)p - (intptr_t)p;
> > 
> > If this was compiled with -O0, ARM and Power might well carry a
> > dependency, but given any optimization, the assembly language would have
> > no hint of any such dependency.  So I am not seeing any particular danger.
> 
> The above is a welcome relaxation over C11, since ARM doesn't even give
> you ordering based off false data dependencies. My concern is more to do
> with how this can be specified precisely without prohibing honest compiler
> and hardware optimisations.

That last is the challenge.  I believe that I am pretty close, but I am
sure that additional adjustment will be required.  Especially given that
we also need the memory model to be amenable to formal analysis.

> Out of interest, how do you tackle examples (4) and (5) of (assuming the
> reads are promoted to consume loads)?:
> 
>   http://www.cl.cam.ac.uk/~pes20/cpp/notes42.html
> 
> my understanding is that you permit both outcomes (I appreciate you're
> not directly tackling out-of-thin-air, but treatment of dependencies
> is heavily related).

Let's see...  #4 is as follows, given promotion to memory_order_consume
and (I am guessing) memory_order_relaxed:

r1 = atomic_load_explicit(&x, memory_order_consume);
if (r1 == 42)
  atomic_store_explicit(&y, r1, memory_order_relaxed);
--
r2 = atomic_load_explicit(&y, memory_order_consume);
if (r2 == 42)
  atomic_store_explicit(&x, 42, memory_order_relaxed);
else
  atomic_store_explicit(&x, 42, memory_order_relaxed);

The second thread does not have a proper control dependency, even with
the memory_order_consume load because both branches assign the same
value to "x".  This means that the compiler is within its rights to
optimize this into the following:

r1 = atomic_load_explicit(&x, memory_order_consume);
if (r1 == 42)
  atomic_store_explicit(&y, r1, memory_order_relaxed);
--
r2 = atomic_load_explicit(&y, memory_order_consume);
atomic_store_explicit(&x, 42, memory_order_relaxed);

There is no dependency between the second thread's pair of statements,
so both the compiler and the CPU are within their rights to optimize
further as follows:

r1 = atomic_load_explicit(&x, memory_order_consume);
if (r1 == 42)
  atomic_store_explicit(&y, r1, memory_order_relaxed);
--
atomic_store_explicit(&x, 42, memory_order_relaxed);
r2 = atomic_load_explicit(&y, memory_order_consume);

If the compiler makes this final optimization, even mythical SC hardware
is within its rights to end up with (r1 == 42 && r2 == 42).  Which is
fine, as far as I am concerned.  Or at least something that can be
lived with.

On to #5:

r1 = atomic_load_explicit(&x, memory_order_consume);
if (r1 == 42)
  atomic_store_explicit(&y, r1, memory_order_relaxed);

r2 = atomic_load_explicit(&y, memory_order_consume);
if (r2 == 42)
  atomic_store_explicit(&x, 42, memory_order_relaxed);

The first thread's accesses are dependency ordered.  The second thread's
ordering is in a corner case that memory-barriers.txt does not cover.
You are supposed to start control dependencies with READ_ONCE_CTRL(), not
a memory_order_cons

Re: [c++std-parallel-1632] Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Paul E. McKenney
On Wed, May 20, 2015 at 04:54:51PM +0100, Andrew Haley wrote:
> On 05/20/2015 04:46 PM, Will Deacon wrote:
> > I'm not sure... you'd require the compiler to perform static analysis of
> > loops to determine the state of the machine when they exit (if they exit!)
> > in order to show whether or not a dependency is carried to subsequent
> > operations. If it can't prove otherwise, it would have to assume that a
> > dependency *is* carried, and it's not clear to me how it would use this
> > information to restrict any subsequent dependency removing optimisations.
> 
> It'd just convert consume to acquire.

It should not need to, actually.

Thanx, Paul



Re: optimization question

2015-05-20 Thread Martin Uecker

 mark maule :
> On 5/20/2015 3:27 AM, Richard Biener wrote:
> > On Mon, May 18, 2015 at 10:01 PM, mark maule  wrote:

> > The usual issue with this kind of behavior is out-of-bound accesses of
> > arrays in a loop
> > or invoking undefined behavior when signed integer operations wrap.
> >
> >
> > uint32_toutLun[ BS_CFG_DRIVE_GROUPS ];
> >
> > and
> >
> >while ( ( dgHandle < ( BS_CFG_DRIVE_GROUPS + 1 ) ) &&
> > ...
> >   dgDestageOut = bs_destageData.outLun[ dgHandle ];
> >
> > looks like this might access outLun[BS_CFG_DRIVE_GROUPS] which is
> > out-of-bounds.
> >
> > Richard.
> 
> You are correct, and when I change outLun[] to be size 
> BS_CFG_DRIVE_GROUPS+1, the generated asm looks like it will account for 
> dgHandle in the while() loop.  I will pass this back to our development 
> team to get a proper fix.
> 
> Now, the followon:  Something in the compiler/optimizer recognized this 
> out of bounds situation - should a warning have been emitted? Or are 
> there ambiguities which make a warning generation here inappropriate?

Yes, ideally a compiler should emit a warning. C compilers traditionally
were not very good at this, but it turns out very recent versions of GCC
can do this:

test.c:14:23: warning: iteration 10u invokes undefined behavior 
[-Waggressive-loop-optimizations]
  dgDestageOut = outLun[ dgHandle ];
   ^
test.c:11:13: note: containing loop
   while ( ( dgHandle < ( BS_CFG_DRIVE_GROUPS + 1 ) ) )


For this simplified test case:

#include 

#define BS_CFG_DRIVE_GROUPS 10
uint32_t dgDestageLimit = 0;
uint32_t outLun[ BS_CFG_DRIVE_GROUPS ];

void test(void)
{
   int dgHandle = 0;

  while ( ( dgHandle < ( BS_CFG_DRIVE_GROUPS + 1 ) ) )
  {
 uint32_t dgDestageOut;
 dgDestageOut = outLun[ dgHandle ];
 dgHandle++;
  }
}


Martin


gcc-4.9-20150520 is now available

2015-05-20 Thread gccadmin
Snapshot gcc-4.9-20150520 is now available on
  ftp://gcc.gnu.org/pub/gcc/snapshots/4.9-20150520/
and on various mirrors, see http://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 4.9 SVN branch
with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-4_9-branch 
revision 223463

You'll find:

 gcc-4.9-20150520.tar.bz2 Complete GCC

  MD5=f878eb14e8decde762c8aef8c1e231c7
  SHA1=75124e08cb6bb04c35f0f5f5e94d2bf669da4369

Diffs from 4.9-20150513 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-4.9
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.


Re: [i386] Scalar DImode instructions on XMM registers

2015-05-20 Thread Vladimir Makarov



On 20/05/15 04:17 AM, Ilya Enkovich wrote:

On 19 May 11:22, Vladimir Makarov wrote:

On 05/18/2015 08:13 AM, Ilya Enkovich wrote:

2015-05-06 17:18 GMT+03:00 Ilya Enkovich :
Hi Vladimir,

Could you please comment on this?



Ilya, I think that the idea is worth to try but results might be
mixed.  It is hard to say until you actually try it (as example, Jan
implemented -fpmath=both and it looks a pretty good idea at least
for me but when I checked SPEC2000 the results were not so good even
with IRA/LRA).

Long ago I did some experiments and found that spilling into SSE
would benefitial for Intel CPUs but not for AMD ones.  As I remember
I also found that storing several scalar values into one SSE reg and
extracting it when you need to do some (fp) arithmetics would
benefitial for AMD but not for Intel CPUs.   In literature more
general approach is called bitwise register allocator.  Actually it
would be a pretty big IRA/LRA project from which some targets might
benefit.

I suspect such things are not trivially done in IRA/LRA and want to make it as 
an independent optimization because its application seems to be quite narrow.
Yes, that is true.  The complications and implementation complexity will 
be probably very high in this project and the positive results are not 
sure.  So the project might have a small value.


As for the wrong code, it is hard for me to say anything w/o RA
dumps.  If you send me the dump (-fira-verbose=16), i might say more
what is going on.



Here are some dumps from my reproducer.  The problematic register is r108.

Thanks.  For me it looks like an inheritance bug.  It is really hard to 
fix the bug w/o the source code.  Could you send me your patch in order 
I can debug RA with it to investigate more.