Spurious register spill with volatile function argument

2016-03-26 Thread Michael Clark
Seems I had misused volatile. I removed ‘volatile’ from the function argument 
on test_0 and it prevented the spill through the stack.

I added volatile because I was trying to avoid the compiler optimising away the 
call to test_0 (as it has no side effects) but it appeared that volatile was 
unnecessary and was a misuse of volatile (intended to indicate storage may 
change outside of the control of the compiler). However it is an interesting 
case… as a register arguments don’t have storage.

GCC, Clang folk, any ideas on why there is a stack spill for a volatile 
register argument passed in esi? Does volatile force the argument to have 
storage allocated on the stack? Is this a corner case in the C standard? This 
argument in the x86_64 calling convention only has a register, so technically 
it can’t change outside the control of the C "virtual machine” so volatile has 
a vague meaning here. This seems to be a case of interpreting the C standard in 
such a was as to make sure that a volatile argument “can be changed” outside 
the control of the C "virtual machine” by explicitly giving it a storage 
location on the stack. I think volatile scalar arguments are a special case and 
that the volatile type label shouldn’t widen the scope beyond the register 
unless it actually *needs* storage to spill. This is not a volatile stack 
scoped variable unless the C standard interprets ABI register parameters as 
actually having ‘storage’ so this is questionable… Maybe I should have gotten a 
warning… or the volatile type qualifier on a scalar register argument should 
have been ignored…

volatile for scalar function arguments seems to mean: “make this volatile and 
subject to change outside of the compiler” rather than being a qualifier for 
its storage (which is a register).

# gcc
test_0:
 mov DWORD PTR [rsp-4], esi
 mov ecx, DWORD PTR [rsp-4]
 mov eax, edi
 cdq
 idivecx
 mov eax, edx
 ret

# clang
test_0:
mov dword ptr [rsp - 4], esi
xor edx, edx
mov eax, edi
div dword ptr [rsp - 4]
mov eax, edx
ret

/* Test program compiled on x86_64 with: cc -O3 -fomit-frame-pointer 
-masm=intel -S test.c -o test.S  */

#include 
#include 

static const int p = 8191;
static const int s = 13;

int __attribute__ ((noinline)) test_0(unsigned int k, volatile int p)
{
 return k % p;
}

int __attribute__ ((noinline)) test_1(unsigned int k)
{
 return k % p;
}

int __attribute__ ((noinline)) test_2(unsigned int k)
{
 int i = (k&p) + (k>>s);
 i = (i&p) + (i>>s);
 if (i>=p) i -= p;
 return i;
}

int main()
{
 test_0(1, 8191); /* control */
 for (int i = INT_MIN; i < INT_MAX; i++) {
 int r1 = test_1(i), r2 = test_2(i);
 if (r1 != r2) printf("%d %d %d\n", i, r1, r2);
 }
}

> On 27 Mar 2016, at 2:32 PM, Andrew Waterman  wrote:
> 
> It would be good to figure out how to get rid of the spurious register spills.
> 
> The strength reduction optimization isn't always profitable on Rocket,
> as it increases instruction count and code size.  The divider has an
> early out and for small numbers is quite fast.
> 
> On Fri, Mar 25, 2016 at 5:43 PM, Michael Clark  wrote:
>> Now considering I have no idea how many cycles it takes for an integer 
>> divide on the Rocket so the optimisation may not be a win.
>> 
>> Trying to read MuDiv in multiplier.scala, and will at some point run some 
>> timings in the cycle-accurate simulator.
>> 
>> In either case, the spurious stack moves emitted by GCC are curious...
>> 
>>> On 26 Mar 2016, at 9:42 AM, Michael Clark  wrote:
>>> 
>>> Hi All,
>>> 
>>> I have found an interesting case where an optimisation is not being applied 
>>> by GCC on RISC-V. And also some strange assembly output from GCC on RISC-V.
>>> 
>>> Both GCC and Clang appear to optimise division by a constant Mersenne prime 
>>> on x86_64 however GCC on RISC-V is not applying this optimisation.
>>> 
>>> See test program and assembly output for these platforms:
>>> 
>>> * GCC -O3 on RISC-V
>>> * GCC -O3 on x86_64
>>> * LLVM/Clang -O3 on x86_64
>>> 
>>> Another strange observation is GCC on RISC-V is moving a1 to a5 via a stack 
>>> store followed by a stack load. Odd? GCC 5 also seems to be doing odd stuff 
>>> with stack ‘moves' on x86_64, moving esi to ecx via the stack (I think 
>>> recent x86 micro-architecture treats tip of the stack like an extended 
>>> register file so this may only have a small penalty on x86).
>>> 
>>> See GCC on RISC-V is emitting this:
>>> 
>>> test_0:
>>>  add sp,sp,-16
>>

Re: coding question

2018-12-11 Thread Michael Clark
Hi Cesar,

> On 12/12/2018, at 10:31 AM, Moreno, Cesar A  wrote:
> 
> 
> 
> HOW do Imalloc or newan array of complex numbers in  GNU C++  code ?
> MALLOC does not work,   how do I use  MALLOC correctly or  NEW ?
> 
> I made a  struct called  complex  (for a complex number with realp and imagp)
> And I want to make an array of   128 locations  (where each location includes 
> arealp and imagp)

use vector and the “indexed for comprehension”. and don’t hard code a 
constant for the cardinality of a set. the compiler knows how to do this safely 
e.g.:

struct ℂ { float r; float i; };
vector<ℂ> v = { { 1, 0 }, { 0, 1 } };
for (/* auto */ [ i, c ] : v) {
   printf(“%zu: (%f, %f)\n” i, c.r, c.i);
};

I haven’t compiled this so it might have some coding bugs. I might have missed 
auto or it doesn’t work or something. uncomment auto if it doesn’t compile. 
also your local type aliases may differ from mine.

Michael

> Struct complex
> {
> Float  realp;
> Float imagp;
> }
> 
> //  tdin  is a pointer to an array of complex numbers
> complex*  tdin;
> 
> 
> //  form the array with  MALLOC  or whatever works in  GNU C++
> tdin =  (complex*)  malloc ( 128 * sizeof (complex) );
> 
> 
> then I can print the contents of the array with
> 
> for (int k;  k<128; k++)
> {  cout << tdin[k].realp  <<  " and "  <<  tdin[k].imagp  << "endl";
> 



x86 branches vs conditional moves

2017-07-07 Thread Michael Clark
Hi,

Curious about this codegen:

- https://godbolt.org/g/5XxP5S

Why does gcc branch on _Bool, but emits a conditional move for an integer? can 
it emit cmovne instead of branching? also curious where one would change this 
to learn about GCC internals.

It’s not a bug, but it is a performance issue (*1).

I was just curious under which conditions the ternary operator is lowered to 
cmov on x86 and found this difference in lowering.

Michael

[1] https://github.com/xiadz/cmov


#include 

extern int a;
extern int b;
extern int c;
extern _Bool C;

int select_int()
{
return c ? a : b;
}

_Bool select_bool()
{
return C ? a : b;
}

_Bool a_bool()
{
return 2;
}

select_int():
mov eax, DWORD PTR c[rip]
testeax, eax
mov eax, DWORD PTR a[rip]
cmove   eax, DWORD PTR b[rip]
ret
select_bool():
cmp BYTE PTR C[rip], 0
jne .L8
mov eax, DWORD PTR b[rip]
testeax, eax
setne   al
ret
.L8:
mov edx, DWORD PTR a[rip]
testedx, edx
setne   al
ret
a_bool():
mov eax, 1
ret

Redundant loads for bitfield accesses

2017-08-16 Thread Michael Clark
Hi,

Is there any reason for 3 loads being issued for these bitfield accesses, given 
two of the loads are bytes, and one is a half; the compiler appears to know the 
structure is aligned at a half word boundary. Secondly, the riscv code is using 
a mixture of 32-bit and 64-bit adds and shifts. Thirdly, with -Os the riscv 
code size is the same, but the schedule is less than optimal. i.e. the 3rd load 
is issued much later.

- https://cx.rv8.io/g/2YDLTA

code:

struct foo {
  unsigned int a : 5;
  unsigned int b : 5;
  unsigned int c : 5;
};

unsigned int proc_foo(struct foo *p)
{
return p->a + p->b + p->c;
}

riscv asm:

proc_foo(foo*):
  lhu a3,0(a0)
  lbu a4,0(a0)
  lbu a5,1(a0)
  srliw a3,a3,5
  andi a0,a4,31
  srli a5,a5,2
  andi a4,a3,31
  addw a0,a0,a4
  andi a5,a5,31
  add a0,a0,a5
  ret

x86_64 asm:

proc_foo(foo*):
  movzx edx, BYTE PTR [rdi]
  movzx eax, WORD PTR [rdi]
  mov ecx, edx
  shr ax, 5
  and eax, 31
  and ecx, 31
  lea edx, [rcx+rax]
  movzx eax, BYTE PTR [rdi+1]
  shr al, 2
  and eax, 31
  add eax, edx
  ret

hand coded riscv asm:

proc_foo(foo*):
  lhu a1,0(a0)
  srli a2,a1,5
  srli a3,a1,10
  andi a0,a1,31
  andi a2,a2,31
  andi a3,a3,31
  add a0,a0,a2
  add a0,a0,a3
  ret

Michael

Re: Redundant loads for bitfield accesses

2017-08-16 Thread Michael Clark
Here’s a more extreme example:

- https://cx.rv8.io/g/2HWQje

The bitfield type is unsigned int, so one or two 32-bit loads should suffice 
(depending on register pressure). GCC is issuing a lw at some point in the asm.

struct foo {
  unsigned int a : 3;
  unsigned int b : 3;
  unsigned int c : 3;
  unsigned int d : 3;
  unsigned int e : 3;
  unsigned int f : 3;
  unsigned int g : 3;
  unsigned int h : 3;
  unsigned int i : 3;
  unsigned int j : 3;
};

unsigned int proc_foo(struct foo *p)
{
return p->a + p->b + p->c + p->d + p->d + p->e + p->f + p->g + p->h + p->i 
+ p->j;
}

> On 17 Aug 2017, at 10:29 AM, Michael Clark  wrote:
> 
> Hi,
> 
> Is there any reason for 3 loads being issued for these bitfield accesses, 
> given two of the loads are bytes, and one is a half; the compiler appears to 
> know the structure is aligned at a half word boundary. Secondly, the riscv 
> code is using a mixture of 32-bit and 64-bit adds and shifts. Thirdly, with 
> -Os the riscv code size is the same, but the schedule is less than optimal. 
> i.e. the 3rd load is issued much later.
> 
> - https://cx.rv8.io/g/2YDLTA
> 
> code:
> 
>   struct foo {
> unsigned int a : 5;
> unsigned int b : 5;
> unsigned int c : 5;
>   };
> 
>   unsigned int proc_foo(struct foo *p)
>   {
>   return p->a + p->b + p->c;
>   }
> 
> riscv asm:
> 
>   proc_foo(foo*):
> lhu a3,0(a0)
> lbu a4,0(a0)
> lbu a5,1(a0)
> srliw a3,a3,5
> andi a0,a4,31
> srli a5,a5,2
> andi a4,a3,31
> addw a0,a0,a4
> andi a5,a5,31
> add a0,a0,a5
> ret
> 
> x86_64 asm:
> 
>   proc_foo(foo*):
> movzx edx, BYTE PTR [rdi]
> movzx eax, WORD PTR [rdi]
> mov ecx, edx
> shr ax, 5
> and eax, 31
> and ecx, 31
> lea edx, [rcx+rax]
> movzx eax, BYTE PTR [rdi+1]
> shr al, 2
> and eax, 31
> add eax, edx
> ret
> 
> hand coded riscv asm:
> 
>   proc_foo(foo*):
> lhu a1,0(a0)
> srli a2,a1,5
> srli a3,a1,10
> andi a0,a1,31
> andi a2,a2,31
> andi a3,a3,31
> add a0,a0,a2
> add a0,a0,a3
> ret
> 
> Michael



Re: Redundant loads for bitfield accesses

2017-08-16 Thread Michael Clark

> On 17 Aug 2017, at 10:41 AM, Andrew Pinski  wrote:
> 
> On Wed, Aug 16, 2017 at 3:29 PM, Michael Clark  wrote:
>> Hi,
>> 
>> Is there any reason for 3 loads being issued for these bitfield accesses, 
>> given two of the loads are bytes, and one is a half; the compiler appears to 
>> know the structure is aligned at a half word boundary. Secondly, the riscv 
>> code is using a mixture of 32-bit and 64-bit adds and shifts. Thirdly, with 
>> -Os the riscv code size is the same, but the schedule is less than optimal. 
>> i.e. the 3rd load is issued much later.
> 
> 
> Well one thing is most likely SLOW_BYTE_ACCESS is set to 0.  This
> forces byte access for bit-field accesses.  The macro is misnamed now
> as it only controls bit-field accesses right now (and one thing in
> dojump dealing with comparisons with and and a constant but that might
> be dead code).  This should allow for you to get the code in hand
> written form.
> I suspect SLOW_BYTE_ACCESS support should be removed and be assumed to
> be 1 but I have not time to look into each backend to see if it is
> correct to do or not.  Maybe it is wrong for AVR.

Thanks, that’s interesting.

So I should try compiling the riscv backend with SLOW_BYTE_ACCESS = 1? Less 
risk than making a change to x86.

This is clearly distinct from slow unaligned access. It seems odd that O3 
doesn’t coalesce loads even if byte access is slow as one would expect the 
additional cost of the additional loads would outweigh the fact that byte 
accesses are not slow unless something weird is happening with the costs of 
loads of different widths.

x86 could also be helped here too. I guess subsequent loads will be served from 
L1, but that’s not really an excuse for this codegen when the element is 
32-bits aligned (unsigned int).

Bit-field struct member sign extension pattern results in redundant

2017-08-17 Thread Michael Clark
Sorry I had to send again as my Apple mailer is munging emails. I’ve disabled 
RTF.


This one is quite interesting:

- https://cx.rv8.io/g/WXWMTG

It’s another target independent bug. x86 is using some LEA followed by SAR 
trick with a 3 bit shift. Surely SHL 27, SAR 27 would suffice. In any case 
RISC-V seems like a nice target to try to fix this codegen for, as its less 
risk than attempting a fix in x86 ;-)

- https://github.com/riscv/riscv-gcc/issues/89

code:

template 
inline T signextend(const T x)
{
struct {T x:B;} s;
return s.x = x;
}

int sx5(int x) {
return signextend(x);
}

riscv asm:

sx5(int):
  slliw a0,a0,3
  slliw a0,a0,24
  sraiw a0,a0,24
  sraiw a0,a0,3
  ret

hand coded riscv asm

sx5(int):
  slliw a0,a0,27
  sraiw a0,a0,27
  ret

x86 asm:

sx5(int):
  lea eax, [0+rdi*8]
  sar al, 3
  movsx eax, al
  ret

hand coded x86 asm (no worse because the sar depends on the lea)

sx5(int):
  shl edi, 27
  sar edi, 27
  movsx eax, dl
  ret

Re: Bit-field struct member sign extension pattern results in redundant

2017-08-17 Thread Michael Clark

> On 18 Aug 2017, at 10:41 AM, Andrew Pinski  wrote:
> 
> On Thu, Aug 17, 2017 at 3:29 PM, Michael Clark  wrote:
>> Sorry I had to send again as my Apple mailer is munging emails. I’ve 
>> disabled RTF.
>> 
>> 
>> This one is quite interesting:
>> 
>> - https://cx.rv8.io/g/WXWMTG
>> 
>> It’s another target independent bug. x86 is using some LEA followed by SAR 
>> trick with a 3 bit shift. Surely SHL 27, SAR 27 would suffice. In any case 
>> RISC-V seems like a nice target to try to fix this codegen for, as its less 
>> risk than attempting a fix in x86 ;-)
>> 
>> - https://github.com/riscv/riscv-gcc/issues/89
>> 
>> code:
>> 
>>template 
>>inline T signextend(const T x)
>>{
>>struct {T x:B;} s;
>>return s.x = x;
>>}
>> 
>>int sx5(int x) {
>>return signextend(x);
>>}
> 
> on AARCH64 I get:
>sbfiz   w0, w0, 3, 5
>asr w0, w0, 3
>ret

So it is a bug on arm too? and can be done with one sbfiz instruction? 
(assuming I’m understand sbfiz from my first reading) e.g.

sbfiz   w0, w0, 0, 2
ret

> Which is:
> (insn 7 6 8 2 (set (reg:SI 80)
>(sign_extend:SI (ashift:QI (reg:QI 0 x0 [ x+3 ])
>(const_int 3 [0x3] t.cc:5 557 {*extendsi_ashlqi}
> (expr_list:REG_DEAD (reg:SI 0 x0 [ x ])
>(nil)))
> 
> (insn 14 9 17 2 (set (reg/i:SI 0 x0)
>(ashiftrt:SI (reg:SI 80)
>(const_int 3 [0x3]))) t.cc:10 530 {*aarch64_ashr_sisd_or_int_si3}
> (expr_list:REG_DEAD (reg:SI 80)
>(nil)))
> 
> What I suspect is we get a QImode register in there and that is what
> is happening to both x86 and riscv.

Curious. Thanks.

I still need to learn more about gcc pattern matching and lowering.

I actually have this exact code pattern using a struct member to do arbitrary 
width sign extend as it’s the first method for arbitrary bit width extension, 
in Google and on the Stanford Bit Twiddling Hacks page. I understand i’m better 
off changing my pattern to an explicit shift left and shift right, given I’m 
now aware this is unambiguously the optimal lowering for ISAs with 1 cycle 
shifters and no field extract instructions, however given others are likely to 
stumble on the Stanford page, I wouldn’t be surprised if i’m not the only one 
using this pattern:

- https://graphics.stanford.edu/~seander/bithacks.html#FixedSignExtend

The x86 sequence is kind of trick and it’s no better off in instruction count 
by using two shifts. I’m also annoyed that BZHI in BMI2 is an arbitrary width 
zero extend versus a bit extend. I’d like a BEXT-like instruction that copies a 
bit at offset n, but given it is only two instruction saving it should also do 
a right shift so it can be used for signed and unsigned field extract (with a 
bit copy or bit zero operand bit).

Thanks,
Michael.



Re: Bit-field struct member sign extension pattern results in redundant

2017-08-17 Thread Michael Clark

> On 18 Aug 2017, at 11:13 AM, Michael Clark  wrote:
> 
> So it is a bug on arm too? and can be done with one sbfiz instruction? 
> (assuming I’m understand sbfiz from my first reading) e.g.
> 
>   sbfiz   w0, w0, 0, 2
>ret

Getting my 3’s and 5’s swapped. Confused by gcc.

sbfiz   w0, w0, 0, 4

>> Which is:
>> (insn 7 6 8 2 (set (reg:SI 80)
>>   (sign_extend:SI (ashift:QI (reg:QI 0 x0 [ x+3 ])
>>   (const_int 3 [0x3] t.cc:5 557 {*extendsi_ashlqi}
>>(expr_list:REG_DEAD (reg:SI 0 x0 [ x ])
>>   (nil)))
>> 
>> (insn 14 9 17 2 (set (reg/i:SI 0 x0)
>>   (ashiftrt:SI (reg:SI 80)
>>   (const_int 3 [0x3]))) t.cc:10 530 {*aarch64_ashr_sisd_or_int_si3}
>>(expr_list:REG_DEAD (reg:SI 80)
>>   (nil)))
>> 
>> What I suspect is we get a QImode register in there and that is what
>> is happening to both x86 and riscv.
> 



Re: Bit-field struct member sign extension pattern results in redundant

2017-08-18 Thread Michael Clark

> On 18 Aug 2017, at 10:41 PM, Gabriel Paubert  wrote:
> 
> On Fri, Aug 18, 2017 at 10:29:04AM +1200, Michael Clark wrote:
>> Sorry I had to send again as my Apple mailer is munging emails. I’ve 
>> disabled RTF.
>> 
>> 
>> This one is quite interesting:
>> 
>> - https://cx.rv8.io/g/WXWMTG
>> 
>> It’s another target independent bug. x86 is using some LEA followed by SAR 
>> trick with a 3 bit shift. Surely SHL 27, SAR 27 would suffice. In any case 
>> RISC-V seems like a nice target to try to fix this codegen for, as its less 
>> risk than attempting a fix in x86 ;-)
>> 
>> - https://github.com/riscv/riscv-gcc/issues/89
>> 
>> code:
>> 
>>  template 
>>  inline T signextend(const T x)
>>  {
>>  struct {T x:B;} s;
>>  return s.x = x;
>>  }
>> 
>>  int sx5(int x) {
>>  return signextend(x);
>>  }
>> 
>> riscv asm:
>> 
>>  sx5(int):
>>slliw a0,a0,3
>>slliw a0,a0,24
>>sraiw a0,a0,24
>>sraiw a0,a0,3
>>ret
>> 
>> hand coded riscv asm
>> 
>>  sx5(int):
>>slliw a0,a0,27
>>sraiw a0,a0,27
>>ret
>> 
>> x86 asm:
>> 
>>  sx5(int):
>>lea eax, [0+rdi*8]
>>sar al, 3
>>movsx eax, al
>>ret
>> 
>> hand coded x86 asm (no worse because the sar depends on the lea)
>> 
>>  sx5(int):
>>shl edi, 27
>>sar edi, 27
>>movsx eax, dl
> 
> Huh? dl is not a subreg of edi!
> 
> s/edi/edx/ and it may work.
> 
> dil can also be used, but only on 64 bit.

Sorry I meant dil on x86-64. I was sure that it was possible to extend into 
another register. I have not done much i386 asm so I am unaware of the 
constraints. Can the source and dest registers for movsx not differ on i386? I 
thought they could.

In any case, the plot thickens…

I believe we have bugs on both RISC-V and Aarch64.

I found that it at least appears like it is transitioning to a char or short as 
the break is at 24 and 16 depending on the width, and values over 16 work as 
one would expect.

Here is an updated test program: https://cx.rv8.io/g/M9ewNf

template 
inline T signextend(const T x)
{
  struct {T x:B;} s;
  return s.x = x;
}

int sx3(int x) { return signextend(x); }
int sx5(int x) { return signextend(x); }
int sx11(int x) { return signextend(x); }
int sx14(int x) { return signextend(x); }
int sx19(int x) { return signextend(x); }

I filed a bug on riscv-gcc but I think it is target independent code given 
there appears to be an issue on Aarch64. AFAICT, Aarch64 should generate a 
single sbfx for all of the test functions.

- https://github.com/riscv/riscv-gcc/issues/89

Should I file a bug on GCC bugzilla given it looks to be target independent?

On RISC-V, the codegen is much more obviously wrong, but essentially the same 
thing is happening on Aarch64 but there is only one additional instruction 
instead of two.

sx3(int):
  slliw a0,a0,5
  slliw a0,a0,24
  sraiw a0,a0,24
  sraiw a0,a0,5
  ret
sx5(int):
  slliw a0,a0,3
  slliw a0,a0,24
  sraiw a0,a0,24
  sraiw a0,a0,3
  ret
sx11(int):
  slliw a0,a0,5
  slliw a0,a0,16
  sraiw a0,a0,16
  sraiw a0,a0,5
  ret
sx14(int):
  slliw a0,a0,2
  slliw a0,a0,16
  sraiw a0,a0,16
  sraiw a0,a0,2
  ret
sx19(int):
  slliw a0,a0,13
  sraiw a0,a0,13
  ret




Re: Bit-field struct member sign extension pattern results in redundant

2017-08-18 Thread Michael Clark

> On 18 Aug 2017, at 10:56 PM, Michael Clark  wrote:
> 
>> 
>> On 18 Aug 2017, at 10:41 PM, Gabriel Paubert  wrote:
>> 
>> On Fri, Aug 18, 2017 at 10:29:04AM +1200, Michael Clark wrote:
>>> Sorry I had to send again as my Apple mailer is munging emails. I’ve 
>>> disabled RTF.
>>> 
>>> 
>>> This one is quite interesting:
>>> 
>>> - https://cx.rv8.io/g/WXWMTG
>>> 
>>> It’s another target independent bug. x86 is using some LEA followed by SAR 
>>> trick with a 3 bit shift. Surely SHL 27, SAR 27 would suffice. In any case 
>>> RISC-V seems like a nice target to try to fix this codegen for, as its less 
>>> risk than attempting a fix in x86 ;-)
>>> 
>>> - https://github.com/riscv/riscv-gcc/issues/89
>>> 
>>> code:
>>> 
>>> template 
>>> inline T signextend(const T x)
>>> {
>>> struct {T x:B;} s;
>>> return s.x = x;
>>> }
>>> 
>>> int sx5(int x) {
>>> return signextend(x);
>>> }
>>> 
>>> riscv asm:
>>> 
>>> sx5(int):
>>>   slliw a0,a0,3
>>>   slliw a0,a0,24
>>>   sraiw a0,a0,24
>>>   sraiw a0,a0,3
>>>   ret
>>> 
>>> hand coded riscv asm
>>> 
>>> sx5(int):
>>>   slliw a0,a0,27
>>>   sraiw a0,a0,27
>>>   ret
>>> 
>>> x86 asm:
>>> 
>>> sx5(int):
>>>   lea eax, [0+rdi*8]
>>>   sar al, 3
>>>   movsx eax, al
>>>   ret
>>> 
>>> hand coded x86 asm (no worse because the sar depends on the lea)
>>> 
>>> sx5(int):
>>>   shl edi, 27
>>>   sar edi, 27
>>>   movsx eax, dl
>> 
>> Huh? dl is not a subreg of edi!
>> 
>> s/edi/edx/ and it may work.
>> 
>> dil can also be used, but only on 64 bit.

In any case it is more straightforward given we are just dealing with an int 
bitfield member in this case so a regular move would suffice. The mov gets 
retired in rename (movsx and movzx can both be retired in rename I think). The 
same pattern could be used for all 5 function examples, with predictable 
performance, versus 5 different solutions for the 5 different bitfield 
positions (that’s actually very amazing that the pattern matching on x86 finds 
5 unique ways to lower this code pattern).  Padon my Intel syntax. e.g.

sx5(int):
 shl edi, 27
 sar edi, 27
 mov eax, edi

However I’m more interested in what the fix would be for RISC-V and Aarch64. 
It’s less obvious how to fix it on x86-64 because there are too many different 
ways. On RISC-V there is only one obvious way and that is to not artificially 
narrow the type when we are extracting from an int aligned struct bit field 
member to an int. Likewise on Aarch64, one sbfiz instruction.

> Sorry I meant dil on x86-64. I was sure that it was possible to extend into 
> another register. I have not done much i386 asm so I am unaware of the 
> constraints. Can the source and dest registers for movsx not differ on i386? 
> I thought they could.
> 
> In any case, the plot thickens…
> 
> I believe we have bugs on both RISC-V and Aarch64.
> 
> I found that it at least appears like it is transitioning to a char or short 
> as the break is at 24 and 16 depending on the width, and values over 16 work 
> as one would expect.
> 
> Here is an updated test program: https://cx.rv8.io/g/M9ewNf
> 
>   template 
>   inline T signextend(const T x)
>   {
> struct {T x:B;} s;
> return s.x = x;
>   }
> 
>   int sx3(int x) { return signextend(x); }
>   int sx5(int x) { return signextend(x); }
>   int sx11(int x) { return signextend(x); }
>   int sx14(int x) { return signextend(x); }
>   int sx19(int x) { return signextend(x); }
> 
> I filed a bug on riscv-gcc but I think it is target independent code given 
> there appears to be an issue on Aarch64. AFAICT, Aarch64 should generate a 
> single sbfx for all of the test functions.
> 
> - https://github.com/riscv/riscv-gcc/issues/89
> 
> Should I file a bug on GCC bugzilla given it looks to be target independent?
> 
> On RISC-V, the codegen is much more obviously wrong, but essentially the same 
> thing is happening on Aarch64 but there is only one additional instruction 
> instead of two.
> 
>   sx3(int):
> slliw a0,a0,5
> slliw a0,a0,24
> sraiw a0,a0,24
> sraiw a0,a0,5
> ret
>   sx5(int):
> slliw a0,a0,3
> slliw a0,a0,24
> sraiw a0,a0,24
> sraiw a0,a0,3
> ret
>   sx11(int):
> slliw a0,a0,5
> slliw a0,a0,16
> sraiw a0,a0,16
> sraiw a0,a0,5
> ret
>   sx14(int):
> slliw a0,a0,2
> slliw a0,a0,16
> sraiw a0,a0,16
> sraiw a0,a0,2
> ret
>   sx19(int):
> slliw a0,a0,13
> sraiw a0,a0,13
> ret



Quantitative analysis of -Os vs -O3

2017-08-26 Thread Michael Clark
Dear GCC folk,

I have to say that’s GCC’s -Os caught me by surprise after several years using 
Apple GCC and more recently LLVM/Clang in Xcode. Over the last year and a half 
I have been working on RISC-V development and have been exclusively using GCC 
for RISC-V builds, and initially I was using -Os. After performing a 
qualitative/quantitative assessment I don’t believe GCC’s current -Os is 
particularly useful, at least for my needs as it doesn’t provide a commensurate 
saving in size given the sometimes quite huge drop in performance.

I’m quoting an extract from Eric’s earlier email on the Overwhelmed by GCC 
frustration thread, as I think Apple’s documentation which presumably documents 
Clang/LLVM -Os policy is what I would call an ideal -Os (perhaps using -O2 as a 
starting point) with the idea that the current -Os is renamed to -Oz.

-Oz
   (APPLE ONLY) Optimize for size, regardless of performance. -Oz
   enables the same optimization flags that -Os uses, but -Oz also
   enables other optimizations intended solely to reduce code size.
   In particular, instructions that encode into fewer bytes are
   preferred over longer instructions that execute in fewer cycles.
   -Oz on Darwin is very similar to -Os in FSF distributions of GCC.
   -Oz employs the same inlining limits and avoids string 
instructions
   just like -Os.

-Os
   Optimize for size, but not at the expense of speed. -Os enables 
all
   -O2 optimizations that do not typically increase code size.
   However, instructions are chosen for best performance, regardless
   of size. To optimize solely for size on Darwin, use -Oz (APPLE
   ONLY).

I have recently  been working on a benchmark suite to test a RISC-V JIT engine. 
I have performed all testing using GCC 7.1 as the baseline compiler, and during 
the process I have collected several performance metrics, some that are neutral 
to the JIT runtime environment. In particular I have made performance 
comparisons between -Os and -O3 on x86, along with capturing executable file 
sizes, dynamic retired instruction and micro-op counts for x86, dynamic retired 
instruction counts for RISC-V as well as dynamic register and instruction usage 
histograms for RISC-V, for both -Os and -O3.

See the Optimisation section for a charted performance comparison between -O3 
and -Os. There are dozens of other plots that show the differences between -Os 
and -O3.

- https://rv8.io/bench

The Geomean on x86 shows a 19% performance hit for -Os vs -O3 on x86. The 
Geomean of course smooths over some pathological cases where -Os performance is 
severely degraded versus -O3 but not with significant, or commensurate savings 
in size.

I don’t currently have -O2 in my results however it seems like I should add -O2 
to the benchmark suite. If you take a look at the web page you’ll see that 
there is already a huge amount of data given we have captured dynamic register 
frequencies and dynamic instruction frequencies for -Os and -O3. The tables and 
charts are all generated by scripts so if there is interest I could add -O2. I 
can also pretty easily perform runs with new compiler versions as everything is 
completely automated. The biggest factor is that it currently takes 4 hours for 
a full run as we run all of the benchmarks in a simulator to capture dynamic 
register usage and dynamic instruction usage.

After looking at the results, one has to question the utility of -Os in its 
present form, and indeed question how it is actually used in practice, given 
the proportion of savings in executable size. After my assessment I would not 
recommend anyone to use -Os because its savings in size are not proportionate 
to the loss in performance. I feel discouraged from using it after looking at 
the results. I really don’t believe -Os makes the right trades e.g. reducing 
icache pressure can indeed lead to better performance due to reduced code size.

I also wonder whether -O2 level optimisations may be a good starting point for 
a more useful -Os and how one would proceed towards selecting optimisations to 
add back to -Os to increase its usability, or rename the current -Os to -Oz and 
make -Os an alias for -O2. A similar profile to -O2 would probably produce less 
shock for anyone who does quantitative performance analysis of -Os.

In fact there are some interesting issues for the RISC-V backend given the 
assembler performs RVC compression and GCC doesn’t really see the size of 
emitted instructions. It would be an interesting backend to investigate 
improving -Os presuming that a backend can opt in to various optimisations for 
a given optimisation level. RISC-V would gain most of its size and runtime 
icache pressure reduction improvements by getting the highest frequency 
registers allocated within the 8 register set that is accessi

Re: Quantitative analysis of -Os vs -O3

2017-08-26 Thread Michael Clark

> On 26 Aug 2017, at 8:39 PM, Andrew Pinski  wrote:
> 
> On Sat, Aug 26, 2017 at 1:23 AM, Michael Clark  wrote:
>> Dear GCC folk,
>> I have to say that’s GCC’s -Os caught me by surprise after several years 
>> using Apple GCC and more recently LLVM/Clang in Xcode. Over the last year 
>> and a half I have been working on RISC-V development and have been 
>> exclusively using GCC for RISC-V builds, and initially I was using -Os. 
>> After performing a qualitative/quantitative assessment I don’t believe GCC’s 
>> current -Os is particularly useful, at least for my needs as it doesn’t 
>> provide a commensurate saving in size given the sometimes quite huge drop in 
>> performance.
>> 
>> I’m quoting an extract from Eric’s earlier email on the Overwhelmed by GCC 
>> frustration thread, as I think Apple’s documentation which presumably 
>> documents Clang/LLVM -Os policy is what I would call an ideal -Os (perhaps 
>> using -O2 as a starting point) with the idea that the current -Os is renamed 
>> to -Oz.
>> 
>>-Oz
>>   (APPLE ONLY) Optimize for size, regardless of performance. -Oz
>>   enables the same optimization flags that -Os uses, but -Oz also
>>   enables other optimizations intended solely to reduce code 
>> size.
>>   In particular, instructions that encode into fewer bytes are
>>   preferred over longer instructions that execute in fewer 
>> cycles.
>>   -Oz on Darwin is very similar to -Os in FSF distributions of 
>> GCC.
>>   -Oz employs the same inlining limits and avoids string 
>> instructions
>>   just like -Os.
>> 
>>-Os
>>   Optimize for size, but not at the expense of speed. -Os 
>> enables all
>>   -O2 optimizations that do not typically increase code size.
>>   However, instructions are chosen for best performance, 
>> regardless
>>   of size. To optimize solely for size on Darwin, use -Oz (APPLE
>>   ONLY).
>> 
>> I have recently  been working on a benchmark suite to test a RISC-V JIT 
>> engine. I have performed all testing using GCC 7.1 as the baseline compiler, 
>> and during the process I have collected several performance metrics, some 
>> that are neutral to the JIT runtime environment. In particular I have made 
>> performance comparisons between -Os and -O3 on x86, along with capturing 
>> executable file sizes, dynamic retired instruction and micro-op counts for 
>> x86, dynamic retired instruction counts for RISC-V as well as dynamic 
>> register and instruction usage histograms for RISC-V, for both -Os and -O3.
>> 
>> See the Optimisation section for a charted performance comparison between 
>> -O3 and -Os. There are dozens of other plots that show the differences 
>> between -Os and -O3.
>> 
>>- https://rv8.io/bench
>> 
>> The Geomean on x86 shows a 19% performance hit for -Os vs -O3 on x86. The 
>> Geomean of course smooths over some pathological cases where -Os performance 
>> is severely degraded versus -O3 but not with significant, or commensurate 
>> savings in size.
> 
> 
> First let me put into some perspective on -Os usage and some history:
> 1) -Os is not useful for non-embedded users
> 2) the embedded folks really need the smallest code possible and
> usually will be willing to afford the performance hit
> 3) -Os was a mistake for Apple to use in the first place; they used it
> and then GCC got better for PowerPC to use the string instructions
> which is why -Oz was added :)
> 4) -Os is used heavily by the arm/thumb2 folks in bare metal applications.
> 
> Comparing -O3 to -Os is not totally fair on x86 due to the many
> different instructions and encodings.
> Compare it on ARM/Thumb2 or MIPS/MIPS16 (or micromips) where size is a
> big issue.
> I soon have a need to keep overall (bare-metal) application size down
> to just 256k.
> Micro-controllers are places where -Os matters the most.

Fair points.

- Size at all cost is useful for the embedded case where there is a restricted 
footprint.
- It’s fair to compare on RISC-V which has the RVC compressed ISA extension, 
which is conceptually similar to Thumb-2
- Understand renaming -Os to -Oz would cause a few downstream issues for those 
who expect size at all costs.
- There is an achievable use-case for good RVC compression and good performance 
on RISC-V

However the question remains, what options does one choose for size, but not 
size at the expense of speed. -O2 and an -mtune?

I’m probably interested in an -O2 with an -mtune that 

Re: Quantitative analysis of -Os vs -O3

2017-08-26 Thread Michael Clark
FYI - I’ve updated the stats to include -O2 in addition to -O3 and -Os:

- https://rv8.io/bench#optimisation

There are 57 plots and 31 tables. It’s quite a bit of data. It will be quite 
interesting to run these on new gcc releases to monitor changes.

The Geomean for -O2 is 0.98 of -O3 on x86-64. I probably need to add some 
tables that show file sizes per architecture side by side, versus the current 
grouping which is by optimisation level to allow comparisons between 
architectures. If I pivot the data, we can add file size ratios by optimisation 
level per architecture. Note: these are relatively small benchmark programs, 
however the stats are still interesting. I’m most interested in RISC-V register 
allocation at present.

-O2 does pretty well on file size compared to -O3, on all architectures. At a 
glance, the -O2 file sizes are slightly larger than the -Os file sizes but the 
performance increase is considerably more. I could perhaps show ratios of 
performance vs size between -O2 and -Os.

> On 26 Aug 2017, at 10:05 PM, Michael Clark  wrote:
> 
>> 
>> On 26 Aug 2017, at 8:39 PM, Andrew Pinski  wrote:
>> 
>> On Sat, Aug 26, 2017 at 1:23 AM, Michael Clark  wrote:
>>> Dear GCC folk,
>>> I have to say that’s GCC’s -Os caught me by surprise after several years 
>>> using Apple GCC and more recently LLVM/Clang in Xcode. Over the last year 
>>> and a half I have been working on RISC-V development and have been 
>>> exclusively using GCC for RISC-V builds, and initially I was using -Os. 
>>> After performing a qualitative/quantitative assessment I don’t believe 
>>> GCC’s current -Os is particularly useful, at least for my needs as it 
>>> doesn’t provide a commensurate saving in size given the sometimes quite 
>>> huge drop in performance.
>>> 
>>> I’m quoting an extract from Eric’s earlier email on the Overwhelmed by GCC 
>>> frustration thread, as I think Apple’s documentation which presumably 
>>> documents Clang/LLVM -Os policy is what I would call an ideal -Os (perhaps 
>>> using -O2 as a starting point) with the idea that the current -Os is 
>>> renamed to -Oz.
>>> 
>>>   -Oz
>>>  (APPLE ONLY) Optimize for size, regardless of performance. -Oz
>>>  enables the same optimization flags that -Os uses, but -Oz also
>>>  enables other optimizations intended solely to reduce code 
>>> size.
>>>  In particular, instructions that encode into fewer bytes are
>>>  preferred over longer instructions that execute in fewer 
>>> cycles.
>>>  -Oz on Darwin is very similar to -Os in FSF distributions of 
>>> GCC.
>>>  -Oz employs the same inlining limits and avoids string 
>>> instructions
>>>  just like -Os.
>>> 
>>>   -Os
>>>  Optimize for size, but not at the expense of speed. -Os 
>>> enables all
>>>  -O2 optimizations that do not typically increase code size.
>>>  However, instructions are chosen for best performance, 
>>> regardless
>>>  of size. To optimize solely for size on Darwin, use -Oz (APPLE
>>>  ONLY).
>>> 
>>> I have recently  been working on a benchmark suite to test a RISC-V JIT 
>>> engine. I have performed all testing using GCC 7.1 as the baseline 
>>> compiler, and during the process I have collected several performance 
>>> metrics, some that are neutral to the JIT runtime environment. In 
>>> particular I have made performance comparisons between -Os and -O3 on x86, 
>>> along with capturing executable file sizes, dynamic retired instruction and 
>>> micro-op counts for x86, dynamic retired instruction counts for RISC-V as 
>>> well as dynamic register and instruction usage histograms for RISC-V, for 
>>> both -Os and -O3.
>>> 
>>> See the Optimisation section for a charted performance comparison between 
>>> -O3 and -Os. There are dozens of other plots that show the differences 
>>> between -Os and -O3.
>>> 
>>>   - https://rv8.io/bench
>>> 
>>> The Geomean on x86 shows a 19% performance hit for -Os vs -O3 on x86. The 
>>> Geomean of course smooths over some pathological cases where -Os 
>>> performance is severely degraded versus -O3 but not with significant, or 
>>> commensurate savings in size.
>> 
>> 
>> First let me put into some perspective on -Os usage and some history:
>> 1) -Os is not useful for non-embedded users
>> 2) the embedded

Redundant sign-extension instructions on RISC-V

2017-08-29 Thread Michael Clark
Dear GCC folk,


# Issue Background

We’re investigating an issue with redundant sign-extension instructions being 
emitted with the riscv backend. Firstly I would like to state that riscv is 
possibly a unique backend with respect to its canonical sign-extended register 
form due to the following:

- POINTERS_EXTEND_UNSIGNED is false, and the canonical form after 32-bit 
operations on RV64 is sign-extended not zero-extended.

- PROMOTE_MODE is defined on RV64 such that SI mode values are promoted to 
signed DI mode values holding SI mode subregs

- RISC-V does not have register aliases for these different modes, rather 
different instructions are selected to operate on different register widths.

- The 32-bit instructions sign-extend results. e.g. all shifts, add, sub, etc.

- Some instructions such as logical ops only have full word width variants 
(AND, OR, XOR) but these instructions maintain canonical form as there is no 
horizontal movement of bits.


# Issue Details

During porting from GCC 6.1 to GCC 7.1, the RISC-V port was changed to define 
TRULY_NOOP_TRUNCATION to 1 and PROMOTE_MODE was set up to promote values to 
wider modes. This was done to ensure ABI correctness so that registers holding 
narrower modes always contain canonical sign extended values.

The result of this is that sign_extend operations are inserted but later 
eliminated in ree/simplyfy_rtx. In testcase 3 the sign_extend is correctly 
eliminated, and indeed most 32-bit instructions are handled correctly. This is 
what the pattern looks like for a typical 32-bit instruction after expand:

;;
;; Full RTL generated for this function:
;;
(note 1 0 4 NOTE_INSN_DELETED)
(note 4 1 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
(insn 2 4 3 2 (set (reg/v:DI 73 [ rs1 ])
   (reg:DI 10 a0 [ rs1 ])) "lshift.c":1 -1
(nil))
(note 3 2 6 2 NOTE_INSN_FUNCTION_BEG)
(insn 6 3 7 2 (set (reg:SI 74)
   (ashift:SI (subreg/s/u:SI (reg/v:DI 73 [ rs1 ]) 0)
   (const_int 24 [0x18]))) "lshift.c":1 -1
(nil))
(insn 7 6 8 2 (set (reg:DI 75)
   (sign_extend:DI (reg:SI 74))) "lshift.c":1 -1
(nil))
(insn 8 7 12 2 (set (reg:DI 72 [  ])
   (reg:DI 75)) "lshift.c":1 -1
(nil))
(insn 12 8 13 2 (set (reg/i:DI 10 a0)
   (reg:DI 72 [  ])) "lshift.c":1 -1
(nil))
(insn 13 12 0 2 (use (reg/i:DI 10 a0)) "lshift.c":1 -1
(nil))


# Problem test cases

- testcase 1 - open coded bswap - redundant sign extension instructions
- testcase 2 - open coded right shift - redundant sign extension instructions
- testcase 3 - open coded left shift - control / correct behaviour

It appears that the downside to the PROMOTE_MODE transition is that several 
redundant sign-extension operations are cropping up under specific 
circumstances. One of the first small isolators is testcase 1 (below), which is 
an open-coded bswap. You’ll notice the SEXT.W emitted before the return. This 
was then further isolated to an even simpler testcase 2 which is a single right 
shift which also emits SEXT.W before the return. A 32-bit left shift or other 
32-bit operations correctly have the redundant sign_extend eliminated.

We have isolated the code that prevented the right shift from having its 
redundant sign extension eliminated and it is target independent code in 
simplify-rtx. Code in simplify_unary_operation_1 assumes zero extension is more 
efficient, likely an assumption based on the fact that most platforms define 
POINTERS_EXTEND_UNSIGNED to true and naturally promote words to zero extended 
form vs sign extended form. The redundant sign extension appears to be due to 
an optimisation in simply_rtx that generates zero extend for right shifts where 
the value is non zero, based on the assumption that zero extend is” better”. 
This is not true on platforms that define POINTERS_EXTEND_UNSIGNED to false 
i.e. naturally promote subregister width operations to canonical sign-extended 
form internally.


# Partial resolution of test case 2

We have prepared a patch that disables the zero extend optimisation on 
platforms that define POINTERS_EXTEND_UNSIGNED. It fixes the issue in testcase 
2. We believe this fix is very low risk because right shift for values above 
zero will always have a zero sign bit, thus either sign or zero extension can 
be used (this is in the comment), however sign-extension is “better” on RISC-V 
and likely all other platforms with POINTERS_EXTEND_UNSIGNED false. Platforms 
like x86, Aarch64 and most others define POINTERS_EXTEND_UNSIGNED to 1 so would 
not be affected by this change.

Please review this candidate patch to fix the codegen issue for testcase 2.


diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c
index ce632ae..25dd70f 100644
--- a/gcc/simplify-rtx.c
+++ b/gcc/simplify-rtx.c
@@ -1503,6 +1503,10 @@ simplify_unary_operation_1 (enum rtx_code code, 
machine_mode mode, rtx op)
  /* (sign_extend:M (lshiftrt:N  (const_int I))) is better as
 (zero_extend:M (lshiftrt:N  (const_int I))) if I is not 0.  */
  if (GET_

Re: Redundant sign-extension instructions on RISC-V

2017-08-29 Thread Michael Clark

> On 30 Aug 2017, at 12:36 PM, Michael Clark  wrote:
> 
> Dear GCC folk,
> 
> 
> # Issue Background
> 
> We’re investigating an issue with redundant sign-extension instructions being 
> emitted with the riscv backend. Firstly I would like to state that riscv is 
> possibly a unique backend with respect to its canonical sign-extended 
> register form due to the following:
> 
> - POINTERS_EXTEND_UNSIGNED is false, and the canonical form after 32-bit 
> operations on RV64 is sign-extended not zero-extended.
> 
> - PROMOTE_MODE is defined on RV64 such that SI mode values are promoted to 
> signed DI mode values holding SI mode subregs
> 
> - RISC-V does not have register aliases for these different modes, rather 
> different instructions are selected to operate on different register widths.
> 
> - The 32-bit instructions sign-extend results. e.g. all shifts, add, sub, etc.
> 
> - Some instructions such as logical ops only have full word width variants 
> (AND, OR, XOR) but these instructions maintain canonical form as there is no 
> horizontal movement of bits.
> 
> 
> # Issue Details
> 
> During porting from GCC 6.1 to GCC 7.1, the RISC-V port was changed to define 
> TRULY_NOOP_TRUNCATION to 1 and PROMOTE_MODE was set up to promote values to 
> wider modes. This was done to ensure ABI correctness so that registers 
> holding narrower modes always contain canonical sign extended values.
> 
> The result of this is that sign_extend operations are inserted but later 
> eliminated in ree/simplyfy_rtx. In testcase 3 the sign_extend is correctly 
> eliminated, and indeed most 32-bit instructions are handled correctly. This 
> is what the pattern looks like for a typical 32-bit instruction after expand:
> 
> ;;
> ;; Full RTL generated for this function:
> ;;
> (note 1 0 4 NOTE_INSN_DELETED)
> (note 4 1 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
> (insn 2 4 3 2 (set (reg/v:DI 73 [ rs1 ])
>   (reg:DI 10 a0 [ rs1 ])) "lshift.c":1 -1
>(nil))
> (note 3 2 6 2 NOTE_INSN_FUNCTION_BEG)
> (insn 6 3 7 2 (set (reg:SI 74)
>   (ashift:SI (subreg/s/u:SI (reg/v:DI 73 [ rs1 ]) 0)
>   (const_int 24 [0x18]))) "lshift.c":1 -1
>(nil))
> (insn 7 6 8 2 (set (reg:DI 75)
>   (sign_extend:DI (reg:SI 74))) "lshift.c":1 -1
>(nil))
> (insn 8 7 12 2 (set (reg:DI 72 [  ])
>   (reg:DI 75)) "lshift.c":1 -1
>(nil))
> (insn 12 8 13 2 (set (reg/i:DI 10 a0)
>   (reg:DI 72 [  ])) "lshift.c":1 -1
>(nil))
> (insn 13 12 0 2 (use (reg/i:DI 10 a0)) "lshift.c":1 -1
>(nil))
> 
> 
> # Problem test cases
> 
> - testcase 1 - open coded bswap - redundant sign extension instructions
> - testcase 2 - open coded right shift - redundant sign extension instructions
> - testcase 3 - open coded left shift - control / correct behaviour
> 
> It appears that the downside to the PROMOTE_MODE transition is that several 
> redundant sign-extension operations are cropping up under specific 
> circumstances. One of the first small isolators is testcase 1 (below), which 
> is an open-coded bswap. You’ll notice the SEXT.W emitted before the return. 
> This was then further isolated to an even simpler testcase 2 which is a 
> single right shift which also emits SEXT.W before the return. A 32-bit left 
> shift or other 32-bit operations correctly have the redundant sign_extend 
> eliminated.
> 
> We have isolated the code that prevented the right shift from having its 
> redundant sign extension eliminated and it is target independent code in 
> simplify-rtx. Code in simplify_unary_operation_1 assumes zero extension is 
> more efficient, likely an assumption based on the fact that most platforms 
> define POINTERS_EXTEND_UNSIGNED to true and naturally promote words to zero 
> extended form vs sign extended form. The redundant sign extension appears to 
> be due to an optimisation in simply_rtx that generates zero extend for right 
> shifts where the value is non zero, based on the assumption that zero extend 
> is” better”. This is not true on platforms that define 
> POINTERS_EXTEND_UNSIGNED to false i.e. naturally promote subregister width 
> operations to canonical sign-extended form internally.
> 
> 
> # Partial resolution of test case 2
> 
> We have prepared a patch that disables the zero extend optimisation on 
> platforms that define POINTERS_EXTEND_UNSIGNED. It fixes the issue in 
> testcase 2. We believe this fix is very low risk because right shift for 
> values above zero will always have a zero sign bit, thus either sign or zero 
> extension can be used (this is in the comment), however sign-extension is 
> “better” on RISC-V and likely all other platforms with 
> 

Re: Redundant sign-extension instructions on RISC-V

2017-08-30 Thread Michael Clark

> On 30 Aug 2017, at 9:14 PM, Richard Biener  wrote:
> 
> On Wed, Aug 30, 2017 at 2:36 AM, Michael Clark  wrote:
>> Dear GCC folk,
>> 
>> 
>> # Issue Background
>> 
>> We’re investigating an issue with redundant sign-extension instructions 
>> being emitted with the riscv backend. Firstly I would like to state that 
>> riscv is possibly a unique backend with respect to its canonical 
>> sign-extended register form due to the following:
>> 
>> - POINTERS_EXTEND_UNSIGNED is false, and the canonical form after 32-bit 
>> operations on RV64 is sign-extended not zero-extended.
>> 
>> - PROMOTE_MODE is defined on RV64 such that SI mode values are promoted to 
>> signed DI mode values holding SI mode subregs
>> 
>> - RISC-V does not have register aliases for these different modes, rather 
>> different instructions are selected to operate on different register widths.
>> 
>> - The 32-bit instructions sign-extend results. e.g. all shifts, add, sub, 
>> etc.
>> 
>> - Some instructions such as logical ops only have full word width variants 
>> (AND, OR, XOR) but these instructions maintain canonical form as there is no 
>> horizontal movement of bits.
>> 
>> 
>> # Issue Details
>> 
>> During porting from GCC 6.1 to GCC 7.1, the RISC-V port was changed to 
>> define TRULY_NOOP_TRUNCATION to 1 and PROMOTE_MODE was set up to promote 
>> values to wider modes. This was done to ensure ABI correctness so that 
>> registers holding narrower modes always contain canonical sign extended 
>> values.
>> 
>> The result of this is that sign_extend operations are inserted but later 
>> eliminated in ree/simplyfy_rtx. In testcase 3 the sign_extend is correctly 
>> eliminated, and indeed most 32-bit instructions are handled correctly. This 
>> is what the pattern looks like for a typical 32-bit instruction after expand:
>> 
>> ;;
>> ;; Full RTL generated for this function:
>> ;;
>> (note 1 0 4 NOTE_INSN_DELETED)
>> (note 4 1 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
>> (insn 2 4 3 2 (set (reg/v:DI 73 [ rs1 ])
>>   (reg:DI 10 a0 [ rs1 ])) "lshift.c":1 -1
>>(nil))
>> (note 3 2 6 2 NOTE_INSN_FUNCTION_BEG)
>> (insn 6 3 7 2 (set (reg:SI 74)
>>   (ashift:SI (subreg/s/u:SI (reg/v:DI 73 [ rs1 ]) 0)
>>   (const_int 24 [0x18]))) "lshift.c":1 -1
>>(nil))
>> (insn 7 6 8 2 (set (reg:DI 75)
>>   (sign_extend:DI (reg:SI 74))) "lshift.c":1 -1
>>(nil))
>> (insn 8 7 12 2 (set (reg:DI 72 [  ])
>>   (reg:DI 75)) "lshift.c":1 -1
>>(nil))
>> (insn 12 8 13 2 (set (reg/i:DI 10 a0)
>>   (reg:DI 72 [  ])) "lshift.c":1 -1
>>(nil))
>> (insn 13 12 0 2 (use (reg/i:DI 10 a0)) "lshift.c":1 -1
>>(nil))
>> 
>> 
>> # Problem test cases
>> 
>> - testcase 1 - open coded bswap - redundant sign extension instructions
>> - testcase 2 - open coded right shift - redundant sign extension instructions
>> - testcase 3 - open coded left shift - control / correct behaviour
>> 
>> It appears that the downside to the PROMOTE_MODE transition is that several 
>> redundant sign-extension operations are cropping up under specific 
>> circumstances. One of the first small isolators is testcase 1 (below), which 
>> is an open-coded bswap. You’ll notice the SEXT.W emitted before the return. 
>> This was then further isolated to an even simpler testcase 2 which is a 
>> single right shift which also emits SEXT.W before the return. A 32-bit left 
>> shift or other 32-bit operations correctly have the redundant sign_extend 
>> eliminated.
>> 
>> We have isolated the code that prevented the right shift from having its 
>> redundant sign extension eliminated and it is target independent code in 
>> simplify-rtx. Code in simplify_unary_operation_1 assumes zero extension is 
>> more efficient, likely an assumption based on the fact that most platforms 
>> define POINTERS_EXTEND_UNSIGNED to true and naturally promote words to zero 
>> extended form vs sign extended form. The redundant sign extension appears to 
>> be due to an optimisation in simply_rtx that generates zero extend for right 
>> shifts where the value is non zero, based on the assumption that zero extend 
>> is” better”. This is not true on platforms that define 
>> POINTERS_EXTEND_UNSIGNED to false i.e. naturally promote subregister width 
>> operations to canonical sign-extended form internally.
>> 
>> 
>> # Partial resolution of test case 2
>> 
>

Re: Redundant sign-extension instructions on RISC-V

2017-08-30 Thread Michael Clark

> On 30 Aug 2017, at 9:43 PM, Michael Clark  wrote:
> 
>>> diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c
>>> index ce632ae..25dd70f 100644
>>> --- a/gcc/simplify-rtx.c
>>> +++ b/gcc/simplify-rtx.c
>>> @@ -1503,6 +1503,10 @@ simplify_unary_operation_1 (enum rtx_code code, 
>>> machine_mode mode, rtx op)
>>> /* (sign_extend:M (lshiftrt:N  (const_int I))) is better as
>>>(zero_extend:M (lshiftrt:N  (const_int I))) if I is not 0.  */
>>> if (GET_CODE (op) == LSHIFTRT
>>> +#if defined(POINTERS_EXTEND_UNSIGNED)
>>> +  /* we skip this optimisation if pointers naturally extend signed */
>>> + && POINTERS_EXTEND_UNSIGNED
>>> +#endif
>>>&& CONST_INT_P (XEXP (op, 1))
>>>&& XEXP (op, 1) != const0_rtx)
>>>  return simplify_gen_unary (ZERO_EXTEND, mode, op, GET_MODE (op));
>> 
>> Is it just me or does this miss a || mode != Pmode || GET_MODE (op) != 
>> ptr_mode
>> check?  Note the comment says exactly the opposite as the transform...
>> 
>> I’m not even sure why this simplification is correct in the first place?!
> 
> I hope you are not confusing my use of POINTERS_EXTEND_UNSIGNED as a proxy 
> for the property that defines whether sub width operations sign-extend to the 
> full width of the register vs zero extend. Are you taking about our added 
> comment?

Read it as:

riscv.h #define REGISTERS_EXTEND_UNSIGNED false

+#if defined(REGISTERS_EXTEND_UNSIGNED)
+  /* we skip this optimisation if registers naturally extend signed */
+ && REGISTERS_EXTEND_UNSIGNED
+#endif

The comment is the inverse because RISC-V registers naturally extend signed and 
REGISTERS_EXTEND_UNSIGNED is false. GCC defines unsigned false for RISC-V 
promotions.

Re: Redundant sign-extension instructions on RISC-V

2017-08-30 Thread Michael Clark

> On 31 Aug 2017, at 7:20 AM, Matthew Fortune  
> wrote:
> 
> Jeff Law  writes:
>> On 08/30/2017 06:52 AM, Richard Biener wrote:
>>> On Wed, Aug 30, 2017 at 11:53 AM, Michael Clark  
>>> wrote:
>>>> 
>>>>> On 30 Aug 2017, at 9:43 PM, Michael Clark  wrote:
>>>>> 
>>>>>>> diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c
>>>>>>> index ce632ae..25dd70f 100644
>>>>>>> --- a/gcc/simplify-rtx.c
>>>>>>> +++ b/gcc/simplify-rtx.c
>>>>>>> @@ -1503,6 +1503,10 @@ simplify_unary_operation_1 (enum rtx_code code, 
>>>>>>> machine_mode
>> mode, rtx op)
>>>>>>>/* (sign_extend:M (lshiftrt:N  (const_int I))) is better as
>>>>>>>   (zero_extend:M (lshiftrt:N  (const_int I))) if I is not 0.  */
>>>>>>>if (GET_CODE (op) == LSHIFTRT
>>>>>>> +#if defined(POINTERS_EXTEND_UNSIGNED)
>>>>>>> +  /* we skip this optimisation if pointers naturally extend signed 
>>>>>>> */
>>>>>>> + && POINTERS_EXTEND_UNSIGNED
>>>>>>> +#endif
>>>>>>>   && CONST_INT_P (XEXP (op, 1))
>>>>>>>   && XEXP (op, 1) != const0_rtx)
>>>>>>> return simplify_gen_unary (ZERO_EXTEND, mode, op, GET_MODE (op));
>>>>>> 
>>>>>> Is it just me or does this miss a || mode != Pmode || GET_MODE (op) != 
>>>>>> ptr_mode
>>>>>> check?  Note the comment says exactly the opposite as the transform...
>>>>>> 
>>>>>> I’m not even sure why this simplification is correct in the first place?!
>>>>> 
>>>>> I hope you are not confusing my use of POINTERS_EXTEND_UNSIGNED as a 
>>>>> proxy for the
>> property that defines whether sub width operations sign-extend to the full 
>> width of the
>> register vs zero extend. Are you taking about our added comment?
>>> 
>>> I'm talking about using POINTERS_EXTEND_UNSIGNED for sth that looks
>>> unrelated (and that has no easy way to be queried as you noted).
>> Right.  I was going to make the same observation.  I can't see how
>> POINTER_EXTEND_UNSIGNED plays a significant role here.
>> 
>> MIPS has similar properties and my recollection is they did some
>> interesting tricks in the target files to fold the extension back into
>> the arithmetic insns (beyond the usual LOAD_EXTEND_OP,
>> WORD_REGISTER_OPERATIONS, TRULY_NOOP_TRUNCATION, and PROMOTE_MODE stuff).
> 
> If there is a condition to add I would have expected it to be based around
> WORD_REGISTER_OPERATIONS etc too.

Okay. Thanks.

To disable the right shift zero extend optimisation we can perhaps then look 
into an expression that uses WORD_REGISTER_OPERATIONS && LOAD_EXTEND_OP(mode) 
== SIGN_EXTEND. What ever semantically means, sub word operations on SIMode are 
sign-extended vs zero-extended.

>> My recollection was they defined their key insns with match_operators
>> that allowed the sign extension to occur in the arithmetic insns.  BUt I
>> don't see any evidence of that anymore.  But I can distinctly remember
>> discussing it with Ian and Meissner eons ago and its impact on reload in
>> particular.
> 
> I see that riscv has chosen to not allow ior/and/xor with SImode as named
> patterns but instead just for combine to pick up. Given that the
> architecture has almost all the same properties as MIPS I don't follow why
> the SImode version is not allowed at expand time. MIPS relies on all SImode
> values being in a canonical sign extended form at all points and can
> therefore freely represent the dual (or rather no) mode operations, such as
> comparisons and logical operations, as both SI and DI mode. This pretty much
> solves the redundant sign extension issues. Just because the logical
> operations only exist as '64-bit' operations in the 64-bit architecture does
> not mean you can't tell GCC there are 32-bit versions as well; you just have
> to present a logical view of the architecture rather than being overly
> strict. LLVM for MIPS went through similar issues and I suspect RISCV will
> hit the same kind of issues but the same solution was used and both 32bit
> and 64-bit operations described with the same underlying instruction.
> 
> Is there an architectural difference that means riscv can't do the same
> thing?

I don’t believe so.

We were seeking guidance and in the process of discovering and figuring out how 
to best solve these issues as they cropped up.

Changing the logical op patterns to match 32-bit operations on RV64 seems like 
a sensible approach. We’ll also have to look at the SLT/SLTU (Set Less Than/Set 
Less Than Unsigned) patterns as they don’t have 32-bit versions; and verify 
they are doing the right thing with respect to modes and sign extension…

We’ll work on a revised patch and circle back…

Thanks,
Michael

Re: Redundant sign-extension instructions on RISC-V

2017-08-30 Thread Michael Clark

> On 31 Aug 2017, at 2:12 PM, Michael Clark  wrote:
> 
>> 
>> On 31 Aug 2017, at 7:20 AM, Matthew Fortune  
>> wrote:
>> 
>> Jeff Law  writes:
>>> On 08/30/2017 06:52 AM, Richard Biener wrote:
>>>> On Wed, Aug 30, 2017 at 11:53 AM, Michael Clark  
>>>> wrote:
>>>>> 
>>>>>> On 30 Aug 2017, at 9:43 PM, Michael Clark  wrote:
>>>>>> 
>>>>>>>> diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c
>>>>>>>> index ce632ae..25dd70f 100644
>>>>>>>> --- a/gcc/simplify-rtx.c
>>>>>>>> +++ b/gcc/simplify-rtx.c
>>>>>>>> @@ -1503,6 +1503,10 @@ simplify_unary_operation_1 (enum rtx_code code, 
>>>>>>>> machine_mode
>>> mode, rtx op)
>>>>>>>>   /* (sign_extend:M (lshiftrt:N  (const_int I))) is better as
>>>>>>>>  (zero_extend:M (lshiftrt:N  (const_int I))) if I is not 0.  */
>>>>>>>>   if (GET_CODE (op) == LSHIFTRT
>>>>>>>> +#if defined(POINTERS_EXTEND_UNSIGNED)
>>>>>>>> +  /* we skip this optimisation if pointers naturally extend 
>>>>>>>> signed */
>>>>>>>> + && POINTERS_EXTEND_UNSIGNED
>>>>>>>> +#endif
>>>>>>>>  && CONST_INT_P (XEXP (op, 1))
>>>>>>>>  && XEXP (op, 1) != const0_rtx)
>>>>>>>>return simplify_gen_unary (ZERO_EXTEND, mode, op, GET_MODE (op));
>>>>>>> 
>>>>>>> Is it just me or does this miss a || mode != Pmode || GET_MODE (op) != 
>>>>>>> ptr_mode
>>>>>>> check?  Note the comment says exactly the opposite as the transform...
>>>>>>> 
>>>>>>> I’m not even sure why this simplification is correct in the first 
>>>>>>> place?!
>>>>>> 
>>>>>> I hope you are not confusing my use of POINTERS_EXTEND_UNSIGNED as a 
>>>>>> proxy for the
>>> property that defines whether sub width operations sign-extend to the full 
>>> width of the
>>> register vs zero extend. Are you taking about our added comment?
>>>> 
>>>> I'm talking about using POINTERS_EXTEND_UNSIGNED for sth that looks
>>>> unrelated (and that has no easy way to be queried as you noted).
>>> Right.  I was going to make the same observation.  I can't see how
>>> POINTER_EXTEND_UNSIGNED plays a significant role here.
>>> 
>>> MIPS has similar properties and my recollection is they did some
>>> interesting tricks in the target files to fold the extension back into
>>> the arithmetic insns (beyond the usual LOAD_EXTEND_OP,
>>> WORD_REGISTER_OPERATIONS, TRULY_NOOP_TRUNCATION, and PROMOTE_MODE stuff).
>> 
>> If there is a condition to add I would have expected it to be based around
>> WORD_REGISTER_OPERATIONS etc too.
> 
> Okay. Thanks.
> 
> To disable the right shift zero extend optimisation we can perhaps then look 
> into an expression that uses WORD_REGISTER_OPERATIONS && LOAD_EXTEND_OP(mode) 
> == SIGN_EXTEND. What ever semantically means, sub word operations on SIMode 
> are sign-extended vs zero-extended.

I have a new patch to fix the right shift optimisation for platforms that 
zero-extend. The sign extend op is DImode so we check the mode of the enclosed 
shift expression hence XEXP (op, 0). Hopefully we have used the correct macros 
this time. The comments read a little more clearly in this version. I’ve 
amended the prelude comment to mention the optimisation works better for 
platforms that zero extend, and state that the optimisation is skipped for 
platforms that sign extend.

diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c
index ce632ae..3b6e5eb 100644
--- a/gcc/simplify-rtx.c
+++ b/gcc/simplify-rtx.c
@@ -1501,8 +1501,14 @@ simplify_unary_operation_1 (enum rtx_code code, 
machine_mode mode, rtx op)
}
 
   /* (sign_extend:M (lshiftrt:N  (const_int I))) is better as
- (zero_extend:M (lshiftrt:N  (const_int I))) if I is not 0.  */
+ (zero_extend:M (lshiftrt:N  (const_int I))) if I is not 0
+ and the platform zero extends narrower operations */
   if (GET_CODE (op) == LSHIFTRT
+#if defined(WORD_REGISTER_OPERATIONS) && defined(LOAD_EXTEND_OP)
+  /* we skip on platform that sign extend narrower operations */
+ && !(WORD_REGISTER_OPERATIONS &&
+  LOAD_EXTEND_OP(GET_MODE (XEXP (op, 0))) == SIGN_EXTEND)
+#endif
 

Re: Bit-field struct member sign extension pattern results in redundant

2017-09-04 Thread Michael Clark

> On 19 Aug 2017, at 4:10 AM, Richard Henderson  wrote:
> 
> On 08/17/2017 03:29 PM, Michael Clark wrote:
>> hand coded x86 asm (no worse because the sar depends on the lea)
>> 
>>  sx5(int):
>>shl edi, 27
>>sar edi, 27
>>movsx eax, dl
> 
> Typo in the register, but I know what you mean.  More interestingly, edi
> already has the sign-extended value, so "mov eax, edi" sufficies (saving one
> byte or perhaps allowing better register allocation).
> 
> That said, if anyone is tweaking x86-64 patterns on this, consider
> 
> sx5(int):
>   rorxeax, edi, 5
>   sar eax, 27
> 
> where the (newish) bmi2 rorx instruction allows the data to be moved into 
> place
> while rotating, avoiding the extra move entirely.

The patterns might be hard to match unless we can get the gimple expansions for 
bitfield access to use the mode of the enclosing type.

For bitfield accesses to SI mode fields under 8 bits on RISC-V, gcc is 
generating two shifts on QI mode sub-registers, each with a sign-extend.

For bitfield accesses to SI mode fields under 16 bits on RISC-V, gcc is 
generating two shifts on HI mode sub-registers, each with a sign-extend.

The bitfield expansion logic is selecting the narrowest type that can hold the 
bitwidth of the field, versus the bitwidth of the enclosing type and this 
appears to be in the gimple to rtl transform, possibly in expr.c.

Using the mode of the enclosing type for bitfield member access expansion would 
likely solve this problem. I suspect that with the use of caches and word sized 
register accesses, that this sort of change would be reasonable to make for 
32-bit and 64-bit targets.

I also noticed something I didn’t spot earlier. On RV32 the sign extend and 
shifts are correctly coalesced into 27 bit shifts in the combine stage. We are 
presently looking into another redundant sign extension issue on RV64 that 
could potentially be related. It could be that the shift coalescing 
optimisation doesn’t happen unless the redundant sign extensions are eliminated 
early in combine by simplify_rtx. i.e. the pattern is more complex due to 
sign_extend ops that are not eliminated.

- https://cx.rv8.io/g/2FxpNw

RV64 and Aarch64 both appear to have the issue but with different expansions 
for the shift and extend pattern due to the mode of access (QI or HI). Field 
accesses above 16 bits create SI mode accesses and generate the expected code. 
The RISC-V compiler has the #define SLOW_BYTE_ACCESS 1 patch however it appears 
to make no difference in this case. SLOW_BYTE_ACCESS suppresses QI mode and HI 
mode loads in some bitfield test cases when a struct is passed by pointer but 
has no effect on this particular issue. This shows the codegen for the fix to 
the SLOW_BYTE_ACCESS issue. i.e. proof that the compiler as SLOW_BYTE_ACCESS 
defined to 1.

- https://cx.rv8.io/g/TyXnoG

A target independent fix that would solve the issue on ARM and RISC-V would be 
to access bitfield members with the mode of the bitfield member's enclosing 
type instead of the smallest mode that can hold the bitwidth of the type. If we 
had a char or short member in the struct, I can understand the use of QI and HI 
mode, as we would need narrorwer accesses due to alignment issues, but in this 
case the member is in int and one would expect this to expand to SI mode 
accesses if the enclosing type is SI mode.


riscv64:

sx5(int):
  slliw a0,a0,3
  slliw a0,a0,24
  sraiw a0,a0,24
  sraiw a0,a0,3
  ret

sx17(int):
  slliw a0,a0,15
  sraiw a0,a0,15
  ret

riscv32:

sx5(int):
  slli a0,a0,27
  srai a0,a0,27
  ret

sx17(int):
  slli a0,a0,15
  srai a0,a0,15
  ret

aarch64:

sx5(int):
  sbfiz w0, w0, 3, 5
  asr w0, w0, 3
  ret

sx17(int):
  sbfx w0, w0, 0, 17
  ret




Re: RFC: Improving GCC8 default option settings

2017-09-12 Thread Michael Clark

> On 13 Sep 2017, at 1:57 AM, Wilco Dijkstra  wrote:
> 
> Hi all,
> 
> At the GNU Cauldron I was inspired by several interesting talks about 
> improving
> GCC in various ways. While GCC has many great optimizations, a common theme is
> that its default settings are rather conservative. As a result users are 
> required to enable several additional optimizations by hand to get good code.
> Other compilers enable more optimizations at -O2 (loop unrolling in LLVM was
> mentioned repeatedly) which GCC could/should do as well.

There are some nuances to -O2. Please consider -O2 users who wish use it like 
Clang/LLVM’s -Os (-O2 without loop vectorisation IIRC).

Clang/LLVM has an -Os that is like -O2 so adding optimisations that increase 
code size can be skipped from -Os without drastically effecting performance.

This is not the case with GCC where -Os is a size at all costs optimisation 
mode. GCC users option for size not at the expense of speed is to use -O2.

Clang   GCC
-Oz ~=  -Os
-Os ~=  -O2

So if adding optimisations to -O2 that increase code size, please considering 
adding an -O2s that maintains the compact code size of -O2. -O2 generates 
pretty compact code as many performance optimisations tend to reduce code size, 
or otherwise add optimisations that increase code size to -O3. Adding loop 
unrolling on makes sense in the Clang/LLVM context where they have a compact 
code model with good performance i.e. -Os. In GCC this is -O2.

So if you want to enable more optimisations at -O2, please copy -O2 
optimisations to -O2s or rename -Os to -Oz and copy -O2 optimisation defaults 
to a new -Os.

The present reality is that any project that wishes to optimize for size at all 
costs will need to run a configure test for -Oz, and then fall back to -Os, 
given the current disparity between Clang/LLVM and GCC flags here.

> Here are a few concrete proposals to improve GCC's option settings which will
> enable better code generation for most targets:
> 
> * Make -fno-math-errno the default - this mostly affects the code generated 
> for
>  sqrt, which should be treated just like floating point division and not set
>  errno by default (unless you explicitly select C89 mode).
> 
> * Make -fno-trapping-math the default - another obvious one. From the docs:
>  "Compile code assuming that floating-point operations cannot generate 
>   user-visible traps."
>  There isn't a lot of code that actually uses user-visible traps (if any -
>  many CPUs don't even support user traps as it's an optional IEEE feature). 
>  So assuming trapping math by default is way too conservative since there is
>  no obvious benefit to users. 
> 
> * Make -fno-common the default - this was originally needed for pre-ANSI C, 
> but
>  is optional in C (not sure whether it is still in C99/C11). This can
>  significantly improve code generation on targets that use anchors for globals
>  (note the linker could report a more helpful message when ancient code that
>  requires -fcommon fails to link).
> 
> * Make -fomit-frame-pointer the default - various targets already do this at
>  higher optimization levels, but this could easily be done for all targets.
>  Frame pointers haven't been needed for debugging for decades, however if 
> there
>  are still good reasons to keep it enabled with -O0 or -O1 (I can't think of 
> any
>  unless it is for last-resort backtrace when there is no unwind info at a 
> crash),
>  we could just disable the frame pointer from -O2 onwards.
> 
> These are just a few ideas to start. What do people think? I'd welcome 
> discussion
> and other proposals for similar improvements.
> 
> Wilco



Re: RFC: Improving GCC8 default option settings

2017-09-12 Thread Michael Clark

> On 13 Sep 2017, at 12:47 PM, Segher Boessenkool  
> wrote:
> 
> On Wed, Sep 13, 2017 at 09:27:22AM +1200, Michael Clark wrote:
>>> Other compilers enable more optimizations at -O2 (loop unrolling in LLVM was
>>> mentioned repeatedly) which GCC could/should do as well.
>> 
>> There are some nuances to -O2. Please consider -O2 users who wish use it 
>> like Clang/LLVM’s -Os (-O2 without loop vectorisation IIRC).
>> 
>> Clang/LLVM has an -Os that is like -O2 so adding optimisations that increase 
>> code size can be skipped from -Os without drastically effecting performance.
>> 
>> This is not the case with GCC where -Os is a size at all costs optimisation 
>> mode. GCC users option for size not at the expense of speed is to use -O2.
> 
> "Size not at the expense of speed" exists in neither compiler.  Just the
> tradeoffs are different between GCC and LLVM.  It would be a silly
> optimisation target -- it's exactly the same as just "speed"!  Unless
> “speed" means "let's make it faster, and bigger just because" ;-)

I would like to be able to quantify stats on a well known benchmark suite, say 
SPECint 2006 or SPECint 2017 but in my own small benchmark suite I saw a 
disproportionate difference in size between -O2 and -Os, but a significant drop 
in performance with -O2 vs -Os.

- https://rv8.io/bench#optimisation
- https://rv8.io/bench#executable-file-sizes

-O2 is 98% perf of -O3 on x86-64
-Os is 81% perf of -O3 on x86-64

-O2 saves 5% space on -O3 on x86-64
-Os saves 8% space on -Os on x86-64

17% drop in performance for 3% saving in space is not a good trade for a 
“general” size optimisation. It’s more like executable compression.

-O2 seems to be a suite spot for size versus speed.

I could only recommend GCC’s -Os if the user is trying to squeeze something 
down to fit the last few bytes of a ROM and -Oz seems like a more appropriate 
name.

-O2 the current suite spot in GCC and is likely closest in semantics to 
LLVM/Clang -Os and I’d like -O2 binaries to stay lean.

I don’t think O2 should slow down nor should the binariesget larger. Turning up 
knobs that effect code size should be reserved for -O3 until GCC makes a 
distinction between -O2/-O2s and -Os/-Oz.

On RISC-V I believe we could shrink binaries at -O2 further with no sacrifice 
in performance, perhaps with a performance improvement by reducing icache 
bandwidth…

BTW -O2 gets great compression and performance improvements compared to -O0 ;-D 
it’s the points after -O2 where the trade offs don’t correlate.

I like -O2

My 2c.

> GCC's -Os is not "size at all costs" either; there are many options (mostly
> --params) that can decrease code size significantly.  To tune code size
> down for your particular program you have to play with options a bit.  This
> shouldn't be news to anyone.
> 
> '-Os'
> Optimize for size.  '-Os' enables all '-O2' optimizations that do
> not typically increase code size.  It also performs further
> optimizations designed to reduce code size.
> 
>> So if adding optimisations to -O2 that increase code size, please 
>> considering adding an -O2s that maintains the compact code size of -O2. -O2 
>> generates pretty compact code as many performance optimisations tend to 
>> reduce code size, or otherwise add optimisations that increase code size to 
>> -O3. Adding loop unrolling on makes sense in the Clang/LLVM context where 
>> they have a compact code model with good performance i.e. -Os. In GCC this 
>> is -O2.
>> 
>> So if you want to enable more optimisations at -O2, please copy -O2 
>> optimisations to -O2s or rename -Os to -Oz and copy -O2 optimisation 
>> defaults to a new -Os.
> 
> '-O2'
> Optimize even more.  GCC performs nearly all supported
> optimizations that do not involve a space-speed tradeoff.  As
> compared to '-O', this option increases both compilation time and
> the performance of the generated code.
> 
>> The present reality is that any project that wishes to optimize for size at 
>> all costs will need to run a configure test for -Oz, and then fall back to 
>> -Os, given the current disparity between Clang/LLVM and GCC flags here.
> 
> The present reality is that any project that wishes to support both GCC and
> LLVM needs to do configure tests, because LLVM chose to do many things
> differently (sometimes unavoidably).  If GCC would change some options
> to be more like LLVM, all users only ever using GCC would be affected,
> while all other incompatibilities would remain.  Not a good tradeoff at
> all.
> 
> 
> Segher



Re: RFC: Improving GCC8 default option settings

2017-09-12 Thread Michael Clark

> On 13 Sep 2017, at 1:15 PM, Michael Clark  wrote:
> 
> - https://rv8.io/bench#optimisation
> - https://rv8.io/bench#executable-file-sizes
> 
> -O2 is 98% perf of -O3 on x86-64
> -Os is 81% perf of -O3 on x86-64
> 
> -O2 saves 5% space on -O3 on x86-64
> -Os saves 8% space on -Os on x86-64
> 
> 17% drop in performance for 3% saving in space is not a good trade for a 
> “general” size optimisation. It’s more like executable compression.

Sorry fixed typo:

-O2 is 98% perf of -O3 on x86-64
-Os is 81% perf of -O3 on x86-64

-O2 saves 5% space on -O3 on x86-64
-Os saves 8% space on -O3 on x86-64

The extra ~3% space saving for ~17% drop in performance doesn’t seem like a 
good general option for size based on the cost in performance.

Again. I really like GCC’s -O2 and hope that its binaries don’t grow in size 
nor slow down.

Re: Bit-field struct member sign extension pattern results in redundant

2017-09-13 Thread Michael Clark

> On 5 Sep 2017, at 9:35 AM, Michael Clark  wrote:
> 
>> 
>> On 19 Aug 2017, at 4:10 AM, Richard Henderson  wrote:
>> 
>> On 08/17/2017 03:29 PM, Michael Clark wrote:
>>> hand coded x86 asm (no worse because the sar depends on the lea)
>>> 
>>> sx5(int):
>>>   shl edi, 27
>>>   sar edi, 27
>>>   movsx eax, dl
>> 
>> Typo in the register, but I know what you mean.  More interestingly, edi
>> already has the sign-extended value, so "mov eax, edi" sufficies (saving one
>> byte or perhaps allowing better register allocation).
>> 
>> That said, if anyone is tweaking x86-64 patterns on this, consider
>> 
>> sx5(int):
>>  rorxeax, edi, 5
>>  sar eax, 27
>> 
>> where the (newish) bmi2 rorx instruction allows the data to be moved into 
>> place
>> while rotating, avoiding the extra move entirely.
> 
> The patterns might be hard to match unless we can get the gimple expansions 
> for bitfield access to use the mode of the enclosing type.
> 
> For bitfield accesses to SI mode fields under 8 bits on RISC-V, gcc is 
> generating two shifts on QI mode sub-registers, each with a sign-extend.
> 
> For bitfield accesses to SI mode fields under 16 bits on RISC-V, gcc is 
> generating two shifts on HI mode sub-registers, each with a sign-extend.
> 
> The bitfield expansion logic is selecting the narrowest type that can hold 
> the bitwidth of the field, versus the bitwidth of the enclosing type and this 
> appears to be in the gimple to rtl transform, possibly in expr.c.
> 
> Using the mode of the enclosing type for bitfield member access expansion 
> would likely solve this problem. I suspect that with the use of caches and 
> word sized register accesses, that this sort of change would be reasonable to 
> make for 32-bit and 64-bit targets.
> 
> I also noticed something I didn’t spot earlier. On RV32 the sign extend and 
> shifts are correctly coalesced into 27 bit shifts in the combine stage. We 
> are presently looking into another redundant sign extension issue on RV64 
> that could potentially be related. It could be that the shift coalescing 
> optimisation doesn’t happen unless the redundant sign extensions are 
> eliminated early in combine by simplify_rtx. i.e. the pattern is more complex 
> due to sign_extend ops that are not eliminated.
> 
> - https://cx.rv8.io/g/2FxpNw
> 
> RV64 and Aarch64 both appear to have the issue but with different expansions 
> for the shift and extend pattern due to the mode of access (QI or HI). Field 
> accesses above 16 bits create SI mode accesses and generate the expected 
> code. The RISC-V compiler has the #define SLOW_BYTE_ACCESS 1 patch however it 
> appears to make no difference in this case. SLOW_BYTE_ACCESS suppresses QI 
> mode and HI mode loads in some bitfield test cases when a struct is passed by 
> pointer but has no effect on this particular issue. This shows the codegen 
> for the fix to the SLOW_BYTE_ACCESS issue. i.e. proof that the compiler as 
> SLOW_BYTE_ACCESS defined to 1.
> 
> - https://cx.rv8.io/g/TyXnoG
> 
> A target independent fix that would solve the issue on ARM and RISC-V would 
> be to access bitfield members with the mode of the bitfield member's 
> enclosing type instead of the smallest mode that can hold the bitwidth of the 
> type. If we had a char or short member in the struct, I can understand the 
> use of QI and HI mode, as we would need narrorwer accesses due to alignment 
> issues, but in this case the member is in int and one would expect this to 
> expand to SI mode accesses if the enclosing type is SI mode.

It appears that 64-bit targets which promote to an SI mode subregister of DI 
mode register prevents combine from coalescing the shift and sign_extend. The 
only difference I can spot between RV32 which coalesces the shift during 
combine, and RV64 and other 64-bit targets which do not coalesce the shifts 
during combine, is that the 64-bit targets have promoted the shift operand to a 
SI mode subregister of a DI mode register, whereas on RV 32 the shift operand 
is simply an SI mode register. I suspect there is some code in simplify-rtx.c 
that is missing the sub register case however I’m still trying to isolate the 
code that coalesces shift and sign_extend.

I’ve found a similar, perhaps smaller, but related case where a shift is not 
coalesced however in this case it is also not coalesced on RV32

https://cx.rv8.io/g/fPdk2F

> riscv64:
> 
>   sx5(int):
> slliw a0,a0,3
> slliw a0,a0,24
> sraiw a0,a0,24
> sraiw a0,a0,3
> ret
> 
>   sx17(int):
> slliw a0,a0,15
>

Re: RFC: Improving GCC8 default option settings

2017-09-14 Thread Michael Clark

> On 14 Sep 2017, at 3:06 AM, Allan Sandfeld Jensen  wrote:
> 
> On Dienstag, 12. September 2017 23:27:22 CEST Michael Clark wrote:
>>> On 13 Sep 2017, at 1:57 AM, Wilco Dijkstra  wrote:
>>> 
>>> Hi all,
>>> 
>>> At the GNU Cauldron I was inspired by several interesting talks about
>>> improving GCC in various ways. While GCC has many great optimizations, a
>>> common theme is that its default settings are rather conservative. As a
>>> result users are required to enable several additional optimizations by
>>> hand to get good code. Other compilers enable more optimizations at -O2
>>> (loop unrolling in LLVM was mentioned repeatedly) which GCC could/should
>>> do as well.
>> 
>> There are some nuances to -O2. Please consider -O2 users who wish use it
>> like Clang/LLVM’s -Os (-O2 without loop vectorisation IIRC).
>> 
>> Clang/LLVM has an -Os that is like -O2 so adding optimisations that increase
>> code size can be skipped from -Os without drastically effecting
>> performance.
>> 
>> This is not the case with GCC where -Os is a size at all costs optimisation
>> mode. GCC users option for size not at the expense of speed is to use -O2.
>> 
>> ClangGCC
>> -Oz  ~=  -Os
>> -Os  ~=  -O2
>> 
> No. Clang's -Os is somewhat limited compared to gcc's, just like the clang 
> -Og 
> is just -O1. AFAIK -Oz is a proprietary Apple clang parameter, and not in 
> clang proper.

It appears to be in mainline clang.

mclark@anarch128:~$ clang -Oz -c a.c -o a.o
mclark@anarch128:~$ clang -Ox -c a.c -o a.o
error: invalid integral value 'x' in '-Ox'
error: invalid integral value 'x' in '-Ox'
mclark@anarch128:~$ uname -a
Linux anarch128.org 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u3 (2017-08-06) 
x86_64 GNU/Linux
mclark@anarch128:~$ clang --version
clang version 3.8.1-24 (tags/RELEASE_381/final)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

I still think it would be unfortunate to loose the size/speed sweet spot of -O2 
by adding optimisations that increase code size, unless there was a size 
optimisation option that was derived from -O2 at the point -O2 is souped up. 
i.e. create an -O2s (or renaming -Os to -Oz and deriving the new -Os from the 
current -O2).

I’m going to start looking at this point to see whats involved in making a 
patch. Distros want a balance or size and speed might even pick it up, even if 
it is not accepted in mainline.

Re: [RFC] type promotion pass

2017-09-15 Thread Michael Clark

> On 16 Sep 2017, at 1:04 AM, David Edelsohn  wrote:
> 
> On Tue, Sep 5, 2017 at 5:26 AM, Prathamesh Kulkarni
>  wrote:
>> Hi,
>> I have attached revamped version of Kugan's original patch for type promotion
>> (https://gcc.gnu.org/ml/gcc-patches/2014-09/msg00472.html)
>> rebased on r249469. The motivation of the pass is to minimize
>> generation of subregs
>> to avoid redundant zero/sign extensions by carrying out computations
>> in PROMOTE_MODE
>> as much as possible on tree-level.
>> 
>> * Working of pass
>> The pass is dominator-based, and tries to promote type of def to
>> PROMOTE_MODE in each gimple stmt. Before beginning domwalk, all the
>> default definitions are promoted to PROMOTE_MODE
>> in promote_all_ssa_defined_with_nop (). The patch adds a new tree code
>> SEXT_EXPR to represent sign-extension on tree level. CONVERT_EXPR is
>> either replaced by an explicit sign or zero extension depending on the
>> signedness of operands.
>> 
>> The core of the pass is in following two routines:
>> a) promote_ssa: This pass looks at the def of gimple_stmt, and
>> promotes type of def to promoted_type in-place if possible. If it
>> cannot promote def in-place, then
>> it transforms:
>> def = def_stmt
>> to
>> new_def = def_stmt
>> def = convert_expr new_def
>> where new_def is a clone of def, and type of def is set to promoted_type.
>> 
>> b) fixup_use: The main intent is to "fix" uses of a promoted variable
>> to preserve semantics
>> of the code, for instance if the variable is used in stmt where it's
>> original type is required.
>> Another case is when def is not promoted by promote_ssa, but some uses
>> could be promoted.
>> 
>> promote_all_stmts () is the driver function that calls fixup_use and
>> promote_ssa for each stmt
>> within the basic block. The pass relies extensively on dom and vrp to
>> remove redundancies generated by the pass and is thus enabled only if
>> vrp is enabled.
>> 
>> Issues:
>> 1] Pass ordering: type-promote pass generates too many redundancies
>> which can hamper other optimizations. One case I observed was on arm
>> when it inhibited path splitting optimization because the number of
>> stmts in basic block exceeded the value of
>> param-max-jump-thread-duplication-stmts. So I placed the pass just
>> before dom. I am not sure if this is a good approach. Maybe the pass
>> itself
>> should avoid generating redundant statements and not rely too much on
>> dom and vrp ?
>> 
>> 2] Redundant copies left after the pass: When it's safe, vrp optimzies
>> _1 = op1 sext op2
>> into
>> _1 = op1
>> 
>> which leaves redundant copies that are not optimized away at GIMPLE level.
>> I placed pass_copy_prop after vrp to eliminate these copies but not sure if 
>> that
>> is ideal. Maybe we should do this during vrp itself as this is the
>> only case that
>> generates redundant copies ?
>> 
>> 3] Backward phi's: Since we traverse in dominated order, fixup_use()
>> gets called first on a ssa_name that is an argument of a backward-phi.
>> If it promotes the type and later if promote_ssa() decides that the
>> ssa_name should not be promoted, then we have to again walk the
>> backward phi's to undo the "incorrect" promotion, which is done by
>> emitting a convert_expr back to the original type from promoted_type.
>> While I suppose it shouldn't affect correctness, it generates
>> redundant casts and was wondering if there's a better approach to
>> handle this issue.
>> 
>> * SPEC2k6 benchmarking:
>> 
>> Results of  benchmarking the patch for aarch64-linux-gnu cortex-a57:
>> 
>> Improved:
>> 401.bzip2  +2.11%
>> 459.GemsFDTD  +1.42%
>> 464.h264ref   +1.96%
>> 471.omnetpp  +1.05%
>> 481.wrf+0.99%
>> 
>> Regressed:
>> 447.dealII-1.34%
>> 450.soplex  -1.54%
>> 456.hmmer  -3.79%
>> 482.sphinx3 -2.95%
>> 
>> The remaining benchmarks didn't have much difference. However there
>> was some noise
>> in the above run, and I suppose the numbers are not precise.
>> I will give another run with benchmarking. There was no significant
>> difference in code-size with and without patch for SPEC2k6.
>> 
>> * Validation
>> 
>> The patch passes bootstrap+test on x86_64-unknown-linux-gnu,
>> arm-linux-gnueabihf,
>> aarch64-linux-gnu and ppc64le-linux-gnu. On arm-linux-gnueabihf, and
>> aarch64-linux-gnu, there is following fallout:
>> 
>> 1] gcc.dg/attr-alloc_size-11.c missing range info for signed char
>> (test for warnings, line 50)
>> The issue seems to be that we call reset_flow_sensitive_info(def) if
>> def is promoted and that invalidates value range which is probably
>> causing the regression.
>> 
>> 2] gcc.dg/fold-narrowbopcst-1.c scan-tree-dump optimized " = _.* + 156"
>> This requires adjusting the scan-dump.
>> 
>> On ppc64le-linux-gnu, I am observing several regressions in the
>> testsuite. Most of these seem to be because type-promotion is
>> interfering with other optimizations, especially widening_mul and
>> bswap pass. Also observed 

Re: [RFC] type promotion pass

2017-09-15 Thread Michael Clark

> On 16 Sep 2017, at 8:59 AM, Segher Boessenkool  
> wrote:
> 
> Hi!
> 
> On Sat, Sep 16, 2017 at 08:47:03AM +1200, Michael Clark wrote:
>> RISC-V defines promote_mode on RV64 to promote SImode to signed DImode 
>> subregisters.  I did an experiment on RISC-V to not promote SImode to DImode 
>> and it improved codegen for many of my regression test cases, but 
>> unfortunately it breaks the RISC-V ABI.
> 
> It sounds like you should implement TARGET_PROMOTE_FUNCTION_MODE as well?

riscv currently has default_promote_function_mode_always_promote.

gcc/config/riscv/riscv.c:#define TARGET_PROMOTE_FUNCTION_MODE 
default_promote_function_mode_always_promote

I see that default_promote_function_mode_always_promote just calls promote_mode

Is TARGET_PROMOTE_FUNCTION_MODE used to perform canonicalisation before calls 
and returns? i.e. would it be possible to have promote_mode as a no-op (leave 
SImode values as SImode in the RTL), but have TARGET_PROMOTE_FUNCTION_MODE 
perform promotions similar to our current PROMOTE_MODE definition i.e. is 
TARGET_PROMOTE_FUNCTION_MODE the hook that is used to canonicalise values 
*before* calls and *before* returns?

I’ll do some reading…

I was also curious about benchmarking the alternate ABI choice that leaves the 
upper bits undefined, and does narrowing in the caller. It would be an ABI 
break so is a no go, but I was curious about the performance of this option. 
Any 32-bit ops causes narrowing on demand. It’s only the logical ops that would 
need to be explicitly narrowed in this alternate ABI. In any case I don’t think 
the RISC-V maintainers would accept an ABI break however i’m curious as to the 
advantages and disadvantages of ABIs that do and don’t define the upper bits 
for narrower modes when passing values to and from functions.

Thanks,
Michael.

Re: Redundant sign-extension instructions on RISC-V

2017-09-18 Thread Michael Clark
Hi,

I’ve spun an alternative version of RISC-V GCC which preserves RV64 ABI sign 
extended canonicalisation for passed values but keeps values in SImode until 
return.

https://github.com/michaeljclark/riscv-gcc/commits/riscv-gcc-7-mc

These are the changes:

- #undef PROMOTE_MODE
- #define TARGET_PROMOTE_FUNCTION_MODE
- Implement riscv_function_promote to promote arguments and return 
values
- Add SImode logical op patterns to riscv.md

The RTL for the sign extension regression test cases now contain all SImode 
ops, however REE is still not able to remove sign_extend after and/ior/xor/not 
with SImode inputs that are in canonical form. I’m still seeing shifts not 
being coalesced as this likely happens in combine, and the redundant sign 
extensions are not eliminated until much later in REE with this particular 
configuration.

Observations.

- Aarch64 leaves upper bits undefined instead of canonicalising them 
and PROMOTE_MODE promotes QImode and HImode to SImode.
- RISC-V defines explicit sign_extend + op pattern matches in its 
metadata to collapse redundant sign extends
- Explicit _extend metadata patterns appear to be required for REE to 
eliminate redundant sign_extend.
- I removed these patterns and then all functions returning int 
had: sext.w a0, a0; ret
- Explicit _extend metadata patterns span only two ops
- e.g. (sign_extend:DI (plus:SI)), (sign_extend:DI (minus:SI)), 
(sign_extend:DI (any_shift:SI)), etc
- REE may not be able to cross enough transitive dependencies to handle 
sign_extend and/ior/xor/not ops with canonical inputs
- e.g. (sign_extend:DI (logical_op:SI (plus_minus_shift:SI) 
(plus_minus_shift:SI)))
- e.g. (sign_extend:DI (logical_op:SI (logical_op:SI 
(logical_op:SI (plus_minus_shift:SI) (plus_minus_shift:SI)
- metadata pattern matching approach is untenable where there 
is a series of logical ops on canonical inputs. e.g. the bswap test case
- Shift coalescing is done before REE in combine so this may be why we 
are seeing missed optimisations.
- there may be more optimisations that are missed due to the 
sign_extend promotions changing patterns
- Keeping values in SImode until return doesn’t appear to help much 
however it may change how the problem is solved.

Actions

- Backport Prathamesh’s type promotion patch to riscv-gcc. It may suit 
the eager definition of PROMOTE_MODE
- Benchmark Alternative ABI that leaves upper bits undefined, out of 
curiosities sake…

I’ve been reasoning about the alternate ABI. I can see it has the cost of 
widening narrower return values when they are promoted to wider types, but the 
general case of adding integers to pointers would not be affected when it is 
inline. Deferring canonicalisation when there are several routines that return 
ints and work on ints as inputs would seem to indicate generation of less 
operations. I’d really like to see SPECint run on the alternate ABI. However 
the current ABI could be improved if the compiler is better able to reason 
about cases where the sign_extend operations don’t need to be emitted, so it 
would not be a fair test against the current version of the port.

Michael.

> On 30 Aug 2017, at 12:36 PM, Michael Clark  wrote:
> 
> Dear GCC folk,
> 
> 
> # Issue Background
> 
> We’re investigating an issue with redundant sign-extension instructions being 
> emitted with the riscv backend. Firstly I would like to state that riscv is 
> possibly a unique backend with respect to its canonical sign-extended 
> register form due to the following:
> 
> - POINTERS_EXTEND_UNSIGNED is false, and the canonical form after 32-bit 
> operations on RV64 is sign-extended not zero-extended.
> 
> - PROMOTE_MODE is defined on RV64 such that SI mode values are promoted to 
> signed DI mode values holding SI mode subregs
> 
> - RISC-V does not have register aliases for these different modes, rather 
> different instructions are selected to operate on different register widths.
> 
> - The 32-bit instructions sign-extend results. e.g. all shifts, add, sub, etc.
> 
> - Some instructions such as logical ops only have full word width variants 
> (AND, OR, XOR) but these instructions maintain canonical form as there is no 
> horizontal movement of bits.
> 
> 
> # Issue Details
> 
> During porting from GCC 6.1 to GCC 7.1, the RISC-V port was changed to define 
> TRULY_NOOP_TRUNCATION to 1 and PROMOTE_MODE was set up to promote values to 
> wider modes. This was done to ensure ABI correctness so that registers 
> holding narrower modes always contain canonical sign extended values.
> 
> The result of this is that sign_extend operations are inserted but later 
> eliminated in ree/simplyfy_rtx. In testcase 3 the sign_ex

Re: Register Allocation Graph Coloring algorithm and Others

2017-12-18 Thread Michael Clark
Hi Leslie,

I suggest adding these 3 papers to your reading list.

Register allocation for programs in SSA-form
Sebastian Hack, Daniel Grund, and Gerhard Goos
http://www.rw.cdl.uni-saarland.de/~grund/papers/cc06-ra_ssa.pdf

Simple and Efficient Construction of Static Single Assignment Form
Matthias Braun , Sebastian Buchwald , Sebastian Hack , Roland Leißa , 
Christoph Mallon , and Andreas Zwinkau
https://www.info.uni-karlsruhe.de/uploads/publikationen/braun13cc.pdf

Optimal register allocation for SSA-form programs in polynomial time
Sebastian Hack, Gerhard Goos
http://web.cs.ucla.edu/~palsberg/course/cs232/papers/HackGoos-ipl06.pdf

A lot of the earlier literature regarding the register allocation problem 
describes the general graph colouring problem as NP-complete, however previous 
research in Register Allocation has been in the context of programs that were 
not in SSA form. i.e. the Chaitin-Briggs paper states that register allocation 
is NP-complete and proposes an iterative algorithm.

If one reads Hack Goos, there is the discovery that programs that are in SSA 
form have chordal graphs, and the colouring algorithm for chordal graphs can be 
completed in polynomial time. After the cost of building the interference 
graph, it is possible to perform register allocation in a single pass. The key 
is in not modifying the graph.

If one has frequency for each basic block, then one can sort basic blocks by 
frequency, allocating the highest frequency blocks first, and where possible 
assigning the same physcial register to all virtual registers in each phi node 
(to avoid copies). At program points where the live set is larger than k, the 
set of physical registers, one spills the the register that has the largest 
distance between its next use. At least that is how I am thinking about this 
problem. I am also a novice with regards to register allocation.

I intend to develop a register allocator for use in this JIT engine:

rv8: a high performance RISC-V to x86 binary translator
Michael Clark, Bruce Hoult
https://carrv.github.io/papers/clark-rv8-carrv2017.pdf

Our JIT already performs almost twice as fast a QEMU and we are using a static 
register allocation, and QEMU i believe has a register allocator. We are 
mapping a 31 register RISC architecture to a 16 register CISC architecture. I 
have statistics on the current slow-down, and it appears to mainly be L1 
accesses to spilled registers. I believe after we have implemented a register 
allocator in our JIT we will be able to run RISC-V binaries at near native 
performance on x86-64. In fact given we are performing runtime profile guided 
optimisation, it may even be possible for some benchmarks to run faster than 
register allocations that are based on static estimates of loop frequencies.

https://anarch128.org/~mclark/rv8-slides.pdf
https://rv8.io/bench

We’ve still got a long way to go to get to ~1:1 performance for RISC-V on 
x86-64, but I think it is possible, at least for some codes…

Regards,
Michael.

> On 15/12/2017, at 4:18 PM, Leslie Zhai  wrote:
> 
> Hi GCC and LLVM developers,
> 
> I am learning Register Allocation algorithms and I am clear that:
> 
> * Unlimited VirtReg (pseudo) -> limited or fixed or alias[1] PhysReg (hard)
> 
> * Memory (20 - 100 cycles) is expensive than Register (1 cycle), but it has 
> to spill code when PhysReg is unavailable
> 
> * Folding spill code into instructions, handling register coallescing, 
> splitting live ranges, doing rematerialization, doing shrink wrapping are 
> harder than RegAlloc
> 
> * LRA and IRA is default Passes in RA for GCC:
> 
> $ /opt/gcc-git/bin/gcc hello.c
> DEBUG: ../../gcc/lra.c, lra_init_once, line 2441
> DEBUG: ../../gcc/ira-build.c, ira_build, line 3409
> 
> * Greedy is default Pass for LLVM
> 
> But I have some questions, please give me some hint, thanks a lot!
> 
> * IRA is regional register allocator performing graph coloring on a top-down 
> traversal of nested regions, is it Global? compares with Local LRA
> 
> * The papers by Briggs and Chaiten contradict[2] themselves when examine the 
> text of the paper vs. the pseudocode provided?
> 
> * Why  interference graph is expensive to build[3]?
> 
> And I am practicing[4] to use HEA, developed by Dr. Rhydian Lewis, for LLVM 
> firstly.
> 
> 
> [1] https://reviews.llvm.org/D39712
> 
> [2] http://lists.llvm.org/pipermail/llvm-dev/2008-March/012940.html
> 
> [3] https://github.com/joaotavio/llvm-register-allocator
> 
> [4] https://github.com/xiangzhai/llvm/tree/avr/include/llvm/CodeGen/GCol
> 
> -- 
> Regards,
> Leslie Zhai - https://reviews.llvm.org/p/xiangzhai/
> 
> 
> 



Re: RISC-V ELF multilibs

2018-06-01 Thread Michael Clark



> On 1/06/2018, at 6:16 PM, Sebastian Huber 
>  wrote:
> 
> On 29/05/18 20:02, Jim Wilson wrote:
>>> Most variants include the C extension. Would it be possible to add 
>>> -march=rv32g and -march=rv64g variants?
>> 
>> The expectation is that most implementations will include the C extension.  
>> It reduces code size, improves performance, and I think I read somewhere 
>> that it takes only 400 gates to implement.
>> 
>> It isn't practical to try to support every possible combination of 
>> architecture and ABI here, as there are too many possible combinations. But 
>> if there is a major RISC-V target that is rv32g or rv64g then we should 
>> consider it.  You can of course define your own set of multilibs.
> 
> I am not a hardware developer, but I heard that the C extension is not easy 
> to implement in out of order machines.

Easy is relative.

The RISC-V ISA encoding has been designed so that wide fetch is relatively 
easy, possibly an order of magnitude easier than an ISA with a complex variable 
length encoding like x86. 

RV64GC instruction lengths can be derived from the 2 least significant bits in 
the first half-word packet of each instruction for mixed 16/32 bit 
instructions. It does not require an attempt to parse multiple prefixes with 
instructions ranging from 1 byte up to 15 bytes, before arriving at an 
instruction length (x86). From that perspective RV64GC is decidedly simple. 
That said, RV64GC is a little more complex than regular 32-bit encodings like 
RV64G, PowerPC, AArch64, or SPARC.

I don’t think the paper you have linked to draws the conclusion you are 
arriving at. I spoke to Chris Celio about RVC and he said he just just didn’t 
get around to implementing RVC in BOOM, not necessarily that it’s absence is a 
statement of its difficulty rather the academic objectives of BOOM, one of a 
very small number of Open Source OoO processors. Sure, if it was “trivial” then 
BOOM would include it, so it’s certainly non trivial.

For BOOM, RVC requires an instruction buffer that handles unaligned fetches and 
the way the PC is derived further down the pipeline needs to be modified as the 
same micro-op may have a different width depending on whether it was expanded 
from an RVC encoding. It may or may not an extra pre-decoder pipe stage. I have 
a fair understanding of what’s requested. The misaligned fetch is probably the 
most complex but BOOM already needs to have complex cases like the fetch buffer 
spanning cache lines and page boundaries. Unaligned instruction fetches add to 
this.

> For example:
> 
> https://www2.eecs.berkeley.edu/Pubs/TechRpts/2017/EECS-2017-157.html
> 
> -- 
> Sebastian Huber, embedded brains GmbH
> 
> Address : Dornierstr. 4, D-82178 Puchheim, Germany
> Phone   : +49 89 189 47 41-16
> Fax : +49 89 189 47 41-09
> E-Mail  : sebastian.hu...@embedded-brains.de
> PGP : Public key available on request.
> 
> Diese Nachricht ist keine geschäftliche Mitteilung im Sinne des EHUG.
> 


Re: [RFC] Adding Python as a possible language and it's usage

2018-07-19 Thread Michael Clark



On 20/07/2018, at 4:12 AM, Richard Earnshaw (lists) mailto:richard.earns...@arm.com>> wrote:

> On 19/07/18 12:30, Florian Weimer wrote:
>> * Segher Boessenkool:
>> 
>>> What would the advantage of using Python be?  I haven't heard any yet.
>>> Awk may be a bit clunky but at least it is easily readable for anyone.
>> 
>> I'm not an experienced awk programmer, but I don't think plain awk
>> supports arrays of arrays, so there's really no good way to emulate
>> user-defined data types and structure the code.
>> 
> 
> You can do multi-dimentional arrays in awk.  They're flattened a bit
> like tcl ones are, but there are ways of iterating over a dimention.
> See, for example, config/arm/parsecpu.awk which gets up to tricks like that.

Has it occurred to anyone to write a lean and fast tool in the host C++ subset 
that is allowable for use in the host tools. I guess this is currently 
C++98/GNU++98. This adds no additional dependencies. Sure it is a slight level 
of effort higher than writing an awk script, but with a modern C++ this is less 
of a case as it has ever been in the past. I personally use C++11/14 as a 
substitute for python type programs that would normally be considered script 
language material, mostly due to fluency and the fact that modern C++ has grown 
more tractable as a replacement for “fast to code in” languages given it is 
much faster to code in than C.

LLVM discussed changing the host compiler language feature dependencies to 
C++11/C++14. There are obvious but not insurmountable bootstrap requirements. 
i.e. for very old systems it will require an intermediate C++11/C++14 compiler 
to bootstrap LLVM 6.0. Here is LLVM's new compiler baseline and it seems to 
require CentOS 7.

- Clang 3.1
- GCC 4.8
- Visual Studio 2015 (Update 3)

[1] https://llvm.org/docs/GettingStarted.html#getting-a-modern-host-c-toolchain

I find I can be very productive and nearly as concise in C++11 as I can in many 
script languages due to modern versions of , , , , 
, , auto, lambdas, etc. It’s relatively easy to write memory 
clean code from the get go using std::unique_ptr<> and sparing amounts of 
std::shared_ptr<>) and the new C++11 for comprehensions, initializer lists and 
various other enhancements can make coding in “modern C++” relatively friendly 
and productive. In the words of Bjarne Stroustrup: It feels like a new 
language. I can point to examples of small text processing utilities that i’ve 
written that could be written in python with a relatively similar amount of 
effort. Fluency with STL and the use of lean idioms. STL and structs (data 
hiding is only a header tweak away from being broken in a language like C++, 
and the use of struct and is similar to language like python which resorts to 
using underscores or “idiomatic enforcement”). i.e. there are lightweight, fast 
and productive modern C++ idioms that work well with vectors, sets, maps and 
unique_ptr or shared_ptr automatic memory management. I find with modern idioms 
these programs valgrind clean almost always.

Would modern-C++ (for generated files) be considered for GCC 9? The new idioms 
may make parts of the code considerable more concise and could allow use of 
some of the new automatic memory management features. The reason I’m suggesting 
this, is that for anything other than a trivial command line invocation of sed 
or awk, I would tend to write a modern C++ program to do text processing versus 
a script langauge like python. Firstly it is faster, Secondly I am proficient 
enough and the set and map functionality combined with the new automatic memory 
management is sufficient enough that complex nested data structures and text 
processing can handled with relative ease. Note: I do tend to avoid iostream 
and instead use stdc FILE * and fopen/fread/frwite/fclose or POSIX 
open/read/write/close if I want to avoid buffering. I find iostream performance 
is not that great.

How conservative are we? Is C++11 going go be available for use in GCC before 
C++2x in 202x. Indeed  would improve some of the Windows/UNIX 
interoperability. I’ve found that writing C++11/14 allows me to write in an 
idiomatic C/C++ subset that is quite stable across platforms. We now even have 
 on Windows. There has been quite a bit of convergence.

Having the constraint that modern C++11/14 can only be used for generated files 
lessens the burden as the distribution build can maintain the same base 
compiler dependencies.

Michael.

Re: That light at the end of the tunnel?

2018-07-30 Thread Michael Clark


> On 31/07/2018, at 12:47 AM, Eric S. Raymond  wrote:
> 
> Richard Biener :
>> Ok, so let me ask whether you can currently convert trunk and 
>> gcc-{6,7,8}-branch
>> successfully, ignoring "merges" into them (shouldn't have happened).  All 
>> other
>> branches can in theory be converted later if required, right?
> 
> Trunk conversionm is currently broken.  That's what makes this serious 
> problem;
> it was working before.

Didn’t you mention the trunk conversion issue was due to a bug introduced into 
reposurgeon at some date recently as a fix for some other conversion?

Can you not roll back to a version of reposurgeon used for the last successful 
conversion? Assuming you know the date of the last successful conversion, then 
you can use the reposurgeon version from that date.

What were the problems in the last “successful” conversion that led to the 
subsequent changes in reposurgeon?

Were the conversion problems in the last successful conversion major enough 
that the resultant tree was not usable for trunk and the major development 
branches: gcc6, gcc7 and gcc8 branches? Or were the issues on older branches?

If you are bisecting reposurgeon then would you not search for the date of the 
message when you had reported successful conversion here on this list (with 
whatever remaining issues that version had), and use the version of reposurgeon 
from that date for your next trial conversion?

Last known good version of reposurgeon is the problem right? (with whatever 
outstanding issues that version had with the conversion).

[RFC] zip_vector: in-memory block compression of integer arrays

2022-08-16 Thread Michael Clark via Gcc

Hi Folks,

This is an edited version of a message posted on the LLVM Discourse.

I want to share what I have been working on as I feel it may be of 
interest to the GCC compiler developers, specifically concerning alias 
analysis and optimizations for iteration of sparse block-based 
multi-arrays. I also have questions about optimization related to this 
implementation, specifically the observability of alias analysis 
pessimization and memory to register optimizations.


I have been working on _zip_vector_. _zip_vector_ is a compressed 
variable length array that uses vectorized block codecs to compress and 
decompress integers using dense variable bit width deltas as well as 
compressing constant values and sequences. _zip_vector_ employs integer 
block codecs optimized for vector instruction sets using the Google 
Highway C++ library for portable SIMD/vector intrinsics.


The high-level class supports 32-bit and 64-bit compressed integer arrays:

 - `zip_vector`
   - { 8, 16, 24 } bit signed and unsigned fixed-width values.
   - { 8, 16, 24 } bit signed deltas and per block IV.
   - constants and sequences using per block IV and delta.
 - `zip_vector`
   - { 8, 16, 24, 32, 48 } bit signed and unsigned fixed-width values.
   - { 8, 16, 24, 32, 48 } bit signed deltas with per block IV.
   - constants and sequences using per block IV and delta.

Here is a link to the implementation:

- https://github.com/metaparadigm/zvec/

The README has a background on the delta encoding scheme. If you read 
the source, "zvec_codecs.h" contains the low-level vectorized block 
codecs while "zvec_block.h" contains a high-level interface to the block 
codecs using cpuid-based dynamic dispatch. The high-level sparse integer 
vector class leveraging the block codecs is in "zip_vector.h". It has 
been tested with GCC and LLVM on x86-64 using SSE3, AVX, and AVX-512.


The principle of operation is to employ simplified block codecs 
dedicated to only compressing fixed-width integers and thus are 
extremely fast, unlike typical compression algorithms. They are _in the 
order of 30-150GiB/sec_ on a single core when running within the L1 
cache on Skylake AVX-512. From this perspective, zip_vector achieves its 
performance by reducing global memory bandwidth because it fetches and 
stores compressed data to and from RAM and then uses extremely fast 
vector codecs to pack and unpack compressed blocks within the L1 cache. 
From this perspective, it is similar to texture compression codecs, but 
the specific use case is closer to storage for index arrays because the 
block codecs are lossless integer codecs. The performance is striking in 
that it can be faster for in-order read-only traversal than a regular 
array, while the primary goal is footprint reduction.


The design use case is an offsets array that might contain 64-bit values 
but usually contains smaller values. From this perspective, we wanted 
the convenience of simply using `zip_vector` or `zip_vector` 
while benefiting from the space advantages of storing data using 8, 16, 
24, 32, and 48-bit deltas.


Q. Why is it specifically of interest to GCC developers?

I think the best way to answer this is with questions. How can we model 
a block-based iterator for a sparse array that is amenable to vectorization?


There are aspects to the zip_vector iterator design that are *not done 
yet* concerning its current implementation. The iteration has two 
phases. There is an inter-block phase at the boundary of each block (the 
logic inside of `switch_page`) that scans and compresses the previously 
active block, updates the page index, and decompresses the next block. 
Then there is a _broad-phase_ for intra-block accesses, which is 
amenable to vectorization due to the use of fixed-size blocks.


*Making 1D iteration as fast as 2D iteration*

Firstly I have to say that there is a lot of analysis for the 
optimization of the iterator that I would like to discuss. There is the 
issue of hoisting the inter-block boundary test from the fast path so 
that during block boundary traversal, subsequent block endings are 
calculated in advance so that the broad phase only requires a pointer 
increment and comparison with the addresses held in registers.


The challenge is getting past compiler alias analysis. Alias analysis 
seems to prevent caching of the sum of the slab base address and active 
area offset in a register versus being demoted to memory accesses. These 
member variables hold the location of the slab and the offset to the 
uncompressed page which are both on the critical path. When these values 
are in memory, _it adds 4 or more cycles of latency for base address 
calculation on every access_. There is also the possibility to hoist and 
fold the active page check as we know we can make constructive proofs 
concerning changes to that value.


Benchmarks compare the performance of 1D and 2D style iterators. At 
certain times the compiler would hoist the base and offset point

Re: [RFC] zip_vector: in-memory block compression of integer arrays

2022-08-17 Thread Michael Clark via Gcc

On 17/08/22 7:10 pm, Richard Biener wrote:


Q. Why is it specifically of interest to GCC developers?

I think the best way to answer this is with questions. How can we model
a block-based iterator for a sparse array that is amenable to vectorization?

There are aspects to the zip_vector iterator design that are *not done
yet* concerning its current implementation. The iteration has two
phases. There is an inter-block phase at the boundary of each block (the
logic inside of `switch_page`) that scans and compresses the previously
active block, updates the page index, and decompresses the next block.
Then there is a _broad-phase_ for intra-block accesses, which is
amenable to vectorization due to the use of fixed-size blocks.

*Making 1D iteration as fast as 2D iteration*

Firstly I have to say that there is a lot of analysis for the
optimization of the iterator that I would like to discuss. There is the
issue of hoisting the inter-block boundary test from the fast path so
that during block boundary traversal, subsequent block endings are
calculated in advance so that the broad phase only requires a pointer
increment and comparison with the addresses held in registers.

The challenge is getting past compiler alias analysis. Alias analysis
seems to prevent caching of the sum of the slab base address and active
area offset in a register versus being demoted to memory accesses. These
member variables hold the location of the slab and the offset to the
uncompressed page which are both on the critical path. When these values
are in memory, _it adds 4 or more cycles of latency for base address
calculation on every access_. There is also the possibility to hoist and
fold the active page check as we know we can make constructive proofs
concerning changes to that value.

Benchmarks compare the performance of 1D and 2D style iterators. At
certain times the compiler would hoist the base and offset pointers from
member variable accesses into registers in the 1D version making a
noticeable difference in performance. In some respects, from the
perspective of single-threaded code, the only way the pointer to the
active region can change is inside `switch_page(size_t y)`.

The potential payoff is huge because one may be able to access data ~
0.9X - 3.5X faster than simply accessing integers in RAM when combining
the reduction in global memory bandwidth with auto-vectorization, but
the challenge is generating safe code for the simpler 1D iteration case
that is as efficient as explicit 2D iteration.

1D iteration:

  for (auto x : vec) x2 += x;

2D iteration:

  for (size_t i = 0; i < n; i += decltype(vec)::page_interval) {
  i64 *cur = &vec[i], *end = cur + decltype(vec)::page_interval;
  while(cur != end) x2 += *cur++;
  }

Note: In this example, I avoid having a different size loop tail but
that is also a consideration.

I trialled several techniques using a simplified version of the
`zip_vector` class where `switch_page` was substituted with simple logic
so that it was possible to get the compiler to coalesce the slab base
pointer and active area offset into a single calculation upon page
crossings. There is also hoisting of the active_page check
(_y-parameter_) to only occur on block crossings. I found that when the
`switch_page` implementation became more complex, i.e. probably adding
an extern call to `malloc`, the compiler would resort to more
conservatively fetching through a pointer to a member variable for the
base pointer and offset calculation. See here:

https://github.com/metaparadigm/zvec/blob/756e583472028fcc36e94c0519926978094dbb00/src/zip_vector.h#L491-L496

So I got to the point where I thought it would help to get input from
compiler developers to figure out how to observe which internal
constraints are violated by "switch_page"  preventing the base pointer
and offset address calculation from being cached in registers.
"slab_data" and "active_area" are neither volatile nor atomic, so
threads should not expect their updates to be atomic or go through memory.

I tried a large number of small experiments. e.g. so let's try and
collapse "slab_data" and "active_area" into one pointer at the end of
"switch_page" so that we only have one pointer to access. Also, the
"active_page" test doesn't necessarily need to be in the broad phase. I
had attempted to manually hoist these variables by modifying the
iterators but found it was necessary to keep them where they were to
avoid introducing stateful invariants to the iterators that could become
invalidated due to read accesses.

Stack/register-based coroutines could help due to the two distinct
states in the iterator.

I want to say that it is not as simple as one might think on the
surface. I tried several approaches to coalesce address calculations and
move them into the page switch logic, all leading to performance
fall-off, almost like the compiler was carrying some pessimization that
forced touched member variables to be accessed via

Re: stack arenas using alloca

2024-08-14 Thread Michael Clark via Gcc

Hi Folks,

*sending again with Thunderbird because Apple Mail munged the message*.

I wanted to share a seed of an idea I have been ruminating on for a 
while, and that is being able to return alloca memory from a function.


I think it’s trivially possible by hacking the epilogue to unlink the 
frame  pointer but not pop the stack. Stack traces should work fine and 
it just looks like the parent function called an alloca which covers the 
fixed child frame and the alloca that it returned.


I tend to use alloca quite a lot in my programming style because I like 
to avoid the heap. I have a style where I run a loop with a size pass, 
then call alloca, then run the loop again without suppressing writes. I 
would like to be able to return, initially, one alloca to a parent 
function, but more than one should also be possible.


I haven't thought a lot about the mechanics of how one might signal to 
the compiler to suppress popping the stack, but I imagined one could 
return a pointer or structure with _Stack qualifiers to signal that 
stack memory from the child frame is being returned.


This is just a quick note as there is a deeper conversation about 
compacting stack arenas that would be possible if one had a runtime 
*and* compile time reflection API for frame metadata, where the frame is 
expressed as an anonymous C structure perhaps accessible via a function 
on a reference to a function. If you had frame metadata, you could emit 
one or many memmove calls. trivially one just needs the fixed frame 
length to recover the fixed frame stack memory and with full frame 
reflection, it would be possible to defer to a runtime routine in the 
case two or more stack pointers are returned, assuming the compiler 
tracks sizes of alloca calls. Ideally, the reflection API would also be 
constexpr-able meaning the runtime compaction could be inlined. This 
won't work for types that have pointers unless one has a full 
single-thread compacting GC that uses type metadata, but I tend to use 
indices. But unlike heap compaction, it doesn't have to worry about races.


The beauty of stack arenas, and compacting stack arenas, as a subtype, 
are many. Firstly the memory has no thread aliasing issues unlike heap 
memory, meaning analysis is much simpler and there are no pause issues 
like there are for GC. Secondly, it would be quite a neat way to 
constexpr variable-length arrays or a _List type because, unlike malloc, 
it’s much easier to reason about.


I have a relatively complete C reflection API that could be used as a 
starting point for C reflection. It currently works as an LLVM plugin. I 
am not sure how I would do this in GCC. A Python parser came to mind but 
it would be better if it somehow ran inside the compiler. It would also 
be great to have access to the frame of a function as an anonymous 
structure. It’s only runtime at present and it could benefit from a for 
comprehension because looping on variable length lists is a little 
clumsy. If I could return alloca memory in a structure that would 
largely solve this problem.


- https://github.com/michaeljclark/crefl

Regards,
Michael


Re: stack arenas using alloca

2024-08-22 Thread Michael Clark via Gcc

On 8/15/24 06:24, Michael Clark wrote:

Hi Folks,

*sending again with Thunderbird because Apple Mail munged the message*.

I wanted to share a seed of an idea I have been ruminating on for a 
while, and that is being able to return alloca memory from a function.


I think it’s trivially possible by hacking the epilogue to unlink the 
frame  pointer but not pop the stack. Stack traces should work fine and 
it just looks like the parent function called an alloca which covers the 
fixed child frame and the alloca that it returned.


I tend to use alloca quite a lot in my programming style because I like 
to avoid the heap. I have a style where I run a loop with a size pass, 
then call alloca, then run the loop again without suppressing writes. I 
would like to be able to return, initially, one alloca to a parent 
function, but more than one should also be possible.


I haven't thought a lot about the mechanics of how one might signal to 
the compiler to suppress popping the stack, but I imagined one could 
return a pointer or structure with _Stack qualifiers to signal that 
stack memory from the child frame is being returned.


This is just a quick note as there is a deeper conversation about 
compacting stack arenas that would be possible if one had a runtime 
*and* compile time reflection API for frame metadata, where the frame is 
expressed as an anonymous C structure perhaps accessible via a function 
on a reference to a function. If you had frame metadata, you could emit 
one or many memmove calls. trivially one just needs the fixed frame 
length to recover the fixed frame stack memory and with full frame 
reflection, it would be possible to defer to a runtime routine in the 
case two or more stack pointers are returned, assuming the compiler 
tracks sizes of alloca calls. Ideally, the reflection API would also be 
constexpr-able meaning the runtime compaction could be inlined. This 
won't work for types that have pointers unless one has a full single- 
thread compacting GC that uses type metadata, but I tend to use indices. 
But unlike heap compaction, it doesn't have to worry about races.


The beauty of stack arenas, and compacting stack arenas, as a subtype, 
are many. Firstly the memory has no thread aliasing issues unlike heap 
memory, meaning analysis is much simpler and there are no pause issues 
like there are for GC. Secondly, it would be quite a neat way to 
constexpr variable-length arrays or a _List type because, unlike malloc, 
it’s much easier to reason about.


I have a relatively complete C reflection API that could be used as a 
starting point for C reflection. It currently works as an LLVM plugin. I 
am not sure how I would do this in GCC. A Python parser came to mind but 
it would be better if it somehow ran inside the compiler. It would also 
be great to have access to the frame of a function as an anonymous 
structure. It’s only runtime at present and it could benefit from a for 
comprehension because looping on variable length lists is a little 
clumsy. If I could return alloca memory in a structure that would 
largely solve this problem.


- https://github.com/michaeljclark/crefl


It seems to work. I am hacking on LLVM. here are some notes.

- the calling convention could return with the new stack pointer 
covering the dynamic area or it could return with the dynamic delta in a 
spare register for the caller to issue an alloca. I am preferring that 
the stack pointer covers the dynamic region on return because it makes 
the boundary clear on functions returning alloca memory. the stack 
pointer always covers the extent of stack allocated memory.


- the epilogue code that subtracts the dynamic stack offset instead 
needs to not do that, which slightly changes epilogue register restore 
code, if it uses pop instructions, or it needs the delta for frame 
pointer relative moves. in my experiment I stashed the stack pointer in 
%rdi, let the x86 epilogue pop callee save restores, then I fetch the 
return address, restore the stack pointer, and push the return address.


- as mentioned, the epilogue needs to push the return address to the end 
of the stack so that the return instruction works, or it could move it 
into a register and make an indirect branch. similar. this is unless 
folks went with an alternative convention that returns the delta and 
issues an alloca in the parent frame after the call. I don't like that.


- either way, metadata needs to make the call look pretty much like it 
is also a dynamic alloca instruction, so that there is a boundary and 
metadata like there is around dynamic alloca so that stack pointer 
calculations after the function are gonna work properly.


- emitting a write with the stack pointer in alloca instructions is an 
interesting consideration. it wouldn't be depended on so it wouldn't add 
much latency, just a store slot and tiny bit of store bandwidth. it's 
already right there. this is for the debugger.

Re: stack arenas using alloca

2024-08-22 Thread Michael Clark via Gcc

On 8/23/24 15:24, Michael Clark wrote:

On 8/15/24 06:24, Michael Clark wrote:

Hi Folks,

like I said this is crazy talk as alloca isn't even in the C standard. 
but VLAs are, and the current implementation of VLAs depends on alloca.


one more thing. it doesn't require PT_GNU_STACK or writable stacks like 
GCC nested functions. :-) so I think it is safer but it does have safety 
issues, mostly related to stack overflows but its going to need some 
careful analysis with respect to ROP. returning deduced VLA types with 
bounds references in frame metadata and a chain of frame pointers for 
dynamic alloca might help. also my experiment in LLVM needs clobbers in 
the parent frame because while I was able to modify the epilogue based 
on an attribute, I so far haven't figured out how to make it look to the 
translator like an alloca in the parent frame so that subsequent 
references to earlier locals save their stack pointer in a register.


this was my first test function:

char* f(int n) __attribute__((allocareturn))
{
static const char* fmt = "num-%d";
int len = snprintf(0, 0, fmt, n);
char *str = __builtin_alloca(len+1);
snprintf(str, len+1, fmt, n);
return str;
}

but this also works:

char* f(int n) __attribute__((allocareturn))
{
static const char* fmt = "num-%d";
int len = snprintf(0, 0, fmt, n);
char str[len+1];
snprintf(str, len+1, fmt, n);
return str;
}

and this is even better:

f(int n) -> auto
{
static const char* fmt = "num-%d";
int len = snprintf(0, 0, fmt, n);
char str[len+1];
snprintf(str, len+1, fmt, n);
return str;
}

like a reference in an implicit copy constructor for a structure.

struct anon123456 { int &n; int arr[n]; };

that would be positioned at the end of the fixed frame so that the 
parent function knows where the bound is. like I said. crazy talk.


Michael.


Re: stack arenas using alloca

2024-08-22 Thread Michael Clark via Gcc

On 8/23/24 15:46, Michael Clark wrote:
one more thing. it doesn't require PT_GNU_STACK or writable stacks like 
GCC nested functions. 🙂 so I think it is safer but it does have safety 
issues, mostly related to stack overflows but its going to need some 
careful analysis with respect to ROP.


brain isn't working. of course it needs writable stacks. it doesn't need 
executable stacks. starting to think about [[allocareturn]] ROP...


Re: stack arenas using alloca

2024-08-22 Thread Michael Clark via Gcc

On 8/23/24 15:57, Michael Clark wrote:

On 8/23/24 15:46, Michael Clark wrote:
one more thing. it doesn't require PT_GNU_STACK or writable stacks 
like GCC nested functions. 🙂 so I think it is safer but it does have 
safety issues, mostly related to stack overflows but its going to need 
some careful analysis with respect to ROP.


brain isn't working. of course it needs writable stacks. it doesn't need 
executable stacks. starting to think about [[allocareturn]] ROP...


if you loaded the return address from the fixed frame in the epilogue 
into a temporary during restore of callee save registers which usually 
unwinds the dynamic offset, you could translate the return instruction 
into an indirect branch. that gives similar ROP protection to functions 
without dynamic alloca returns.


then the biggest problem is stack overflows. we can use alloca anyway 
just we have to call the function twice with some funky convention that 
suppresses array writes for null pointers because we have to call it in 
the parent frame. this just makes it somewhat more useful for functions 
returning lists where you don't want to allocate on the heap.


if you don't have any side effects in loops besides array writes, you 
could have some funky feature where you run the loop once for a size. I 
tend to write a lot of code like that by hand and hate it. just look at 
code that uses the Vulkan API. it is full of client code that does that.


that's sort of where this line of thinking emerged. it makes it harder 
if you want to write to either stack or heap/data though because you 
need a couple of different translations of the function. I would like 
implicitly deduced memory region qualifiers like _StackRegion, 
_DataRegion, _HeapRegion, and _ThreadRegion for TLS. and i'm making the 
assumption the stack is data race free which it is 99.9% of the time. 
then you can implicitly typedef _StackRegion MyType MyType in metadata 
for a translation that has an input type with _StackRegion.


I'd like an OS where stack can only escape to fibres and fibres can only 
run on one OS thread. that way we could allocate page table roots or 
overlay page tables for threads to make stack fault when accessed by 
another thread. grsecurity already does this for Linux kernel threads.


Michael.


Re: #pragma once behavior

2024-09-08 Thread Michael Clark via Gcc

I like the sound of resolved path identity from search, including the 
constructed full path and the index within include paths. if I was writing a 
compiler from scratch, i'd problem do something like this: 
SHA2(include-path-search-offset-integer-of-found-header-to-support-include-next)
 + 
SHA2(container-relative-constructed-full-path-to-selected-header-during-include-search)
 + SHA2(selected-header-content) and make it timeless. if you evaluate the 
search yourself, you have the constructed full path identity while you are 
searching and you don't need to do tricks to to get a full container root 
relative path because you already have it during the search. it's a forward 
search path constructed relative root. what are the problems? it can include 
the same file twice if it has two path identities (hardlink or symlink) yet 
that seems the safer option to me because constant search configuration 
relative identity sounds better. as they could be defined to be different 
identities wrt #pragma once. it's the semantically safer option. and the 
canonical example that fails (same content, different path, same mtime) would 
succeed with this choice. pick holes in this. I might be overlooking something.


Re: Passing a hidden argument in a dedicated register

2024-12-16 Thread Michael Clark via Gcc


> On 17 Dec 2024, at 5:44 AM, Alexander Monakov  wrote:
> 
> 
>> On Mon, 16 Dec 2024, Florian Weimer via Gcc wrote:
>> 
>> I would like to provide a facility to create wrapper functions without
>> lots of argument shuffling.  To achieve that, the wrapping function and
>> the wrapped function should have the same prototype.  There will be a
>> trampoline that puts additional data somewhere (possibly including the
>> address of the wrapped function, but that interpretation is up to the
>> wrapping function) and then transfers control to the wrapper function
>> with an indirect jump (tail call).
>> 
>> For signal safety, I think the hidden argument needs to be in a register
>> (instead of, say, thread-local storage).  Most System V ABI variants
>> seem to reserve a register for use by the dynamic linker, or for the
>> static chain pointer of nested functions.
>> 
>> Is there a way to reuse either register for this purpose and assign it
>> to a local variable reliably at the start of the wrapper function
>> implementation?
> 
> Not in a way that will work with LLVM, I'm afraid, and with GCC
> you'll have to shield wrappers from LTO:
> 
> register void *r10 asm("r10");
> void f(int, int);
> void f_wrap(int a, int b)
> {
>r10 = f;
>f(a, b);
> }
> 
> LLVM refuses to translate this. With GCC you must compile with -ffixed-r10,
> otherwise r10 is not reserved, and GCC will warn:
> 
> warning: call-clobbered register used for global register variable
>1 | register void *r10 asm("r10");
>  |^~~:
> 
> 
> This is the only approach I'm aware of, apart of generating wrappers
> in asm (speaking of, is there some reason that wouldn't work for you?).

I have a similar use case for replacing a scalar version of a function with a 
vector version. the vector version has some constant reference data that I want 
to hoist into the caller because it is called in a loop. I don't know how to do 
that without breaking iinterface encapsulation and using ifdef to manually 
hoist the uniform reference data into the caller. also I want it to be 
controlled not at the mercy of random optimizations.

in some trusted environment if I had a transient fiield in a context structure 
passed by pointer that the function loaded into a vector register, then if the 
caller checked an identity invariant (context pointer identity does not change 
in the loop) then it could hoist some vector loads into the caller’s loop 
header. but that wouldn’t work for extern. you would need some vector extern 
ABI for functions called in loops that have a setup call that needs to be 
called in a loop header.

it’s due to loads of vector reference data with requirements for interface 
encapsulation for scalar code which tends to make you want to preserve a simple 
calling convention for the function user. in vulkan you would bind a uniform 
into the function’s environment. but we don't want the caller to need to know 
in advance what might be hoisted.

so I want a function designed for loops to export an implicit loop header setup 
stub contingent on the identity of a context pointer. the calling convention 
says the caller needs to call the setup stub if the context identity changes. 
you could do that with attributes on the context identity and an outline 
function containing the hoist. I could then have a loop header setup call that 
is consistent with both scalar or vector code and the scalar version is 
probably empty.

reference data loads for shuffles are huge.

a super regular RISC that encodes constants in immediate blocks.

2025-03-08 Thread Michael Clark via Gcc

Hi Folks,

here is an idea expressed as a simple proof-of-concept simulator.

- https://github.com/michaeljclark/glyph/

I have been working on a proof-of-concept simulator for a RISC 
architecture with an immediate base register next to the program counter 
to split the front-end stream into independent instruction and constant 
streams. I named it glyph. it features a super-regular encoding scheme 
designed for vectorized decoding, and it uses a _i32x2_ vector of 
relative displacements in the link register to branch both instructions 
and constants at the same time. this evolved from thinking about a 
"virt" machine that was good for software to decode but could 
potentially be hardware.


I am still writing tests for it, but it is possible to test out the 
relative link register _(pc,ib)_ vector concept using subroutine branch 
displacements and constant block displacements in constant memory, which 
is read-only like instruction text. it has some interesting 
consequences. stack does not leak spilled absolute addresses so ASLR may 
be harder to bypass.


it uses flags for compare. register 0 can currently contain data. near 
branches are +/-512 bytes _(9-bit is the largest embedded immediate in 
the 16-bit format)_. the proof-of-concept has 8 registers, and there is 
a Python script for generating vectorized switch decoders for the 
variable-length instruction scheme. however, the proof-of-concept 
simulator currently only supports the 16-bit compressed opcode space.


it doesn't have a compiler or assembler yet. but considerable thought 
has gone into the design of the instruction format and the split 
instruction and constant streams. the 16-bit opcode space can access 
64-bit constants, but the 32-bit opcodes will support all typical 
constant sizes, and up to 64 read-write registers and more constants.


linker relocations are quite neat with this architecture. return from 
procedure needs a reloc due to the use of relative displacements in the 
link register. I am keen for feedback. there are a lot of details in the 
repo, including a Python script to explore combinatoric expansion for 
vectorized decoders.


it has a simpler length coding scheme than RISC-V at the expense of one 
bit of coding space in the 16-bit packet. as a result, it uses less 
wires and logic for length decoding of 64-bit instructions. it borrows 
an idea from LEB128. i.e. it is _super regular_. but we won't know how 
it will do on density until we have the 32-bit opcodes.


Michael


a super regular RISC that encodes constants in immediate blocks

2025-05-06 Thread Michael Clark via Gcc

Hi GCC folks,

introducing a new RISC instruction set with a variable length
instruction packet using a super regular extension scheme with
compression techniques to separate the instruction stream into
two streams, one for instructions and and one for constants:

- latest: https://metaparadigm.com/~mclark/glyph.pdf
- current: https://metaparadigm.com/~mclark/glyph-20250507.pdf

I am currently working on an assembler for it. I intend to come
up with something that lets one manually emit constants into a
.const section plus directives to tie together instructions and
constants for local and global symbols, but I may also come up
with a meta assembler mode that lets one use inline constants
whereby the assembler splits things into streams and allocates
address space for constants. I am writing a prototype assembler
in Python. it hopefully should be ready in a few months.

I would like to start with a Python assembler instead of using
binutils because it will make for faster compression tuning when
I get up to layout of the 32-bit opcodes which I intend to have
contain 128-bit packed SIMD, and the 64-bit opcodes I intend to
map to the 5-operand EVEX opcodes in AMD and Intel's AVX-512.

this is a preview release. note: the 16-bit packet is designed as
a standalone ISA, but it needs read-modify-write for bytes. we
have focused on utility for function prologue and epilogue. other
types and atomics will come in the 32-bit packet. it is intended
to be a virtual machine target and I wish to recode AVX-512 to it.
AMD and Intel can bite me if they don't want me to do this.

there are interpreters in C, Python, and Go. and I also have a
related project that I am working on for translation to x86.

- https://github.com/michaeljclark/glyph/
- https://github.com/michaeljclark/x86/

this work has been mentioned on the RISC-V mailing list but this
is the first time here. I think it is ready for early feedback.

Michael.