date:20190522

Re: Determining maximum vector length supported by the CPU?

2019-05-22 Thread Matthias Kretz

Hi Martin,

I agree, we need more information from the compiler. Esp. whether the user 
specified `-mprefer-avx128` or `-mprefer-vector-width=none/128/256/512`.
OTOH `-msve-vector-bits=N` is reported as __ARM_FEATURE_SVE_BITS. So that's 
covered.

Related: PR83875 - because while we're adding things in that area, it'd be 
nice if they worked with target clones as well.

Are you aware of std::experimental::simd? It didn't make GCC 9.1, but you 
can easily patch your (installed) libstdc++ using https://github.com/VcDevel/
std-simd.

Cheers,
  Matthias

On Mittwoch, 22. Mai 2019 08:39:25 CEST Martin Reinecke wrote:
> [Disclaimer: I sent this to gcc-help two weeks ago, but didn't get an
> answer. Maybe the topic is more suited for the main gcc list ... I
> really think the feature in question would be extremely useful to have,
> and easy to add!]
> 
> Hi,
> 
> I'm currently writing an FFT library which tries to make use of SIMD
> instructions and uses a lot of variables with
>  __attribute__ ((vector_size (xyz))
> 
> The resulting source is nicely portable and architecture-independent -
> except for the one place where I need to determine the maximum
> hardware-supported vector length on the target CPU.
> 
> This currently looks like
> 
> #if defined(__AVX__)
> constexpr int veclen=32;
> #elif defined(__SSE2__)
> constexpr int veclen=16;
> [...]
> 
> This approach requires me to add an #ifdef for many architectures, most
> of which I cannot really test on ... and new architectures will be
> unsupported by default.
> 
> Hence my question: is there a way in gcc to determine the hardware
> vector length for the architecture the compiler is currently targeting?
> Some predefined macro like
> 
> HARDWARE_VECTOR_LENGTH_IN_BYTES
> 
> which is 32 for AVX, 16 for SSE2, and has proper values for Neon, VPX
> etc. etc.
> 
> If this is not provided at the moment, would it bo possible to add this
> in the future? This could massively simplify writing and maintaining
> multi-platform SIMD code.
> 
> Thanks,
>   Martin
//

-- 
──
 Dr. Matthias Kretzhttps://kretzfamily.de
 GSI Helmholtzzentrum für Schwerionenforschung https://gsi.de
 SIMD easy and portable https://github.com/VcDevel/Vc
──

Re: Determining maximum vector length supported by the CPU?

2019-05-22 Thread Martin Reinecke

Hi Matthias!

> I agree, we need more information from the compiler. Esp. whether the user 
> specified `-mprefer-avx128` or `-mprefer-vector-width=none/128/256/512`.
> OTOH `-msve-vector-bits=N` is reported as __ARM_FEATURE_SVE_BITS. So that's 
> covered.

Almost ... except that I'd need a platform-agnostic definition. The
point is that the code does not care about the underlying hardware at
all, only for the vector length supported by it.

> Related: PR83875 - because while we're adding things in that area, it'd be 
> nice if they worked with target clones as well.

Yes, this is a problem I've come across as well in the past.
(https://gcc.gnu.org/ml/gcc-help/2018-10/msg00118.html)

> Are you aware of std::experimental::simd? It didn't make GCC 9.1, but you 
> can easily patch your (installed) libstdc++ using https://github.com/VcDevel/
> std-simd.

This looks extremely interesting! I have to look at it in more detail,
but this might be the way to go in the future.
However, the code I'm working on may be incorporated into numpy/scipy at
some point, and the minimum required compilers for these packages are
pretty old. I can't expect more than vanilla C++11 support there.

Cheers,
  Martin

Re: Determining maximum vector length supported by the CPU?

2019-05-22 Thread Richard Biener

On Wed, May 22, 2019 at 10:36 AM Martin Reinecke
 wrote:
>
> Hi Matthias!
>
> > I agree, we need more information from the compiler. Esp. whether the user
> > specified `-mprefer-avx128` or `-mprefer-vector-width=none/128/256/512`.
> > OTOH `-msve-vector-bits=N` is reported as __ARM_FEATURE_SVE_BITS. So that's
> > covered.
>
> Almost ... except that I'd need a platform-agnostic definition. The
> point is that the code does not care about the underlying hardware at
> all, only for the vector length supported by it.

And then you run into AVX + SSE vs. AVX2 + SSE cases where the (optimal) length
depends on the component type...

I wonder if we'd want to have a 'auto' length instead ;)

I suppose exposing a __BIGGEST_VECTOR__ might be possible (not for SVE
though?).

> > Related: PR83875 - because while we're adding things in that area, it'd be
> > nice if they worked with target clones as well.
>
> Yes, this is a problem I've come across as well in the past.
> (https://gcc.gnu.org/ml/gcc-help/2018-10/msg00118.html)
>
> > Are you aware of std::experimental::simd? It didn't make GCC 9.1, but you
> > can easily patch your (installed) libstdc++ using 
> > https://github.com/VcDevel/
> > std-simd.
>
> This looks extremely interesting! I have to look at it in more detail,
> but this might be the way to go in the future.
> However, the code I'm working on may be incorporated into numpy/scipy at
> some point, and the minimum required compilers for these packages are
> pretty old. I can't expect more than vanilla C++11 support there.
>
> Cheers,
>   Martin
>

Re: Determining maximum vector length supported by the CPU?

2019-05-22 Thread Martin Reinecke

On 5/22/19 11:17 AM, Richard Biener wrote:

>> Almost ... except that I'd need a platform-agnostic definition. The
>> point is that the code does not care about the underlying hardware at
>> all, only for the vector length supported by it.
> 
> And then you run into AVX + SSE vs. AVX2 + SSE cases where the (optimal) 
> length
> depends on the component type...

You mean different vector lengths for float, double, int etc?
I would be fine to have different macros for those, if necessary.

> I wonder if we'd want to have a 'auto' length instead ;)

I know it's weird, but for my code (which is definitely not a synthetic
test case) that would work perfectly :)
If you'd like to have a look:
https://gitlab.mpcdf.mpg.de/mtr/pocketfft/tree/cpp
(the only platform-dependent part is on lines 82-95 of the header file).

Still, I would need a way to determine how long the vectors actually
are. But it would probably be enough to measure this at runtime then.

Cheers,
  Martin

Re: Determining maximum vector length supported by the CPU?

2019-05-22 Thread Matthias Kretz

On Mittwoch, 22. Mai 2019 11:17:57 CEST Richard Biener wrote:
> On Wed, May 22, 2019 at 10:36 AM Martin Reinecke
>  wrote:
> > Hi Matthias!
> > 
> > > I agree, we need more information from the compiler. Esp. whether the
> > > user
> > > specified `-mprefer-avx128` or `-mprefer-vector-width=none/128/256/512`.
> > > OTOH `-msve-vector-bits=N` is reported as __ARM_FEATURE_SVE_BITS. So
> > > that's
> > > covered.
> > 
> > Almost ... except that I'd need a platform-agnostic definition. The
> > point is that the code does not care about the underlying hardware at
> > all, only for the vector length supported by it.
> 
> And then you run into AVX + SSE vs. AVX2 + SSE cases where the (optimal)
> length depends on the component type...
> 
> I wonder if we'd want to have a 'auto' length instead ;)
> 
> I suppose exposing a __BIGGEST_VECTOR__ might be possible (not for SVE
> though?).

AVX vs. AVX2 is a valid question. std::experimental::simd solves it by having 
the size depend on the element type. So I agree, `vector_size(auto)` seems 
like a possible solution. One could then take the sizeof if the number is 
important to know.

-- 
──
 Dr. Matthias Kretzhttps://kretzfamily.de
 GSI Helmholtzzentrum für Schwerionenforschung https://gsi.de
 SIMD easy and portable https://github.com/VcDevel/Vc
──

Re: Determining maximum vector length supported by the CPU?

2019-05-22 Thread Matthias Kretz

On Mittwoch, 22. Mai 2019 11:27:25 CEST Martin Reinecke wrote:
> Still, I would need a way to determine how long the vectors actually
> are. But it would probably be enough to measure this at runtime then.

FWIW, something that took me way too long to figure out: You can use vector 
builtins very conveniently in C++ with the traits I define at https://
github.com/VcDevel/std-simd/blob/59e6348a9d34b4ef4f5ef1fc4f423dd75e1987f3/
experimental/bits/simd.h#L925 (ignore _SimdWrapper: I'm working on phasing it 
out after I discovered the _VectorTraits solution). You'll need __vector_type, 
__is_vector_type and _VectorTraits and then you can write:

template >
void f(T x) {
  using element_type = typename VT::value_type;
  constexpr auto N = VT::_S_width;
  ...
}

f(x) only participates in overload resolution if x is a vector builtin type 
because otherwise VectorTraits leads to a substitution failure (SFINAE).

-- 
──
 Dr. Matthias Kretzhttps://kretzfamily.de
 GSI Helmholtzzentrum für Schwerionenforschung https://gsi.de
 SIMD easy and portable https://github.com/VcDevel/Vc
──

Re: -Wformat-diag: floating-point or floating point?

2019-05-22 Thread Richard Earnshaw (lists)

On 21/05/2019 21:18, Bill Schmidt wrote:
> On 5/21/19 11:47 AM, Martin Sebor wrote:
>> The GCC coding style says to use "floating-point" as an adjective
>> rather than "floating point."  After enhancing the -Wformat-diag
>> checker to detect this I found a bunch of uses of the latter, such
>> as in:
>>
>>   gcc/c/c-decl.c:10944
>>   gcc/c/c-parser.c:9423, 9446, 9450, etc.
>>   gcc/convert.c:418, 422
>>   gcc/cp/call.c:5070
>>   gcc/cp/cvt.c:886
>>
>> Before I fix them all and adjust the tests, I want to make sure
>> we really want to follow this rule.  The C standard uses both
>> interchangeably.  With just one exception, the C++ standard uses
>> the hyphenated form.
> The hyphenated form is correct English, so I certainly prefer it. :-)
> 

It's not quite as simple as that.  Hyphens should be used to make it
clear what is the adjective and what is the noun:

   A floating-point number (hyphenated) is a number with a
   floating point (no hyphen).

In the first case 'floating-point' is the adjective and qualifies
number.  In the second case 'floating' is the adjective and qualifies
'point'.

But this is English, so there are probably some exceptions even then -
but not in this case, I think.  :-)

R.

eud7

2019-05-22 Thread 開蕟魒

微1365一09一04255

Re: mfentry and Darwin.

2019-05-22 Thread Iain Sandoe

Hi Uros,

> On 21 May 2019, at 19:36, Uros Bizjak  wrote:
> 
> On Tue, May 21, 2019 at 6:15 PM Iain Sandoe  wrote:
>> 

>> It seems to me that (even if it was working “properly”, which it isn't)  
>> ‘-mfentry’ would break ABI on Darwin for both 32 and 64b - which require 
>> 16byte stack alignment at call sites.
>> 
>> For Darwin, the dynamic loader enforces the requirement when it can and will 
>> abort a program that tries to make a DSO linkage with the stack in an 
>> incorrect alignment.  We previously had a bug against profiling caused by 
>> exactly this issue (but when the mcount call was in the post-prologue 
>> position).
>> 
>> Actually, I’m not sure why it’s not an issue for other 64b platforms that 
>> use the psABI (AFAIR,  it’s only the 32b case that’s Darwin-specific).
> 
> The __fentry__ in glibc is written as a wrapper around the call to
> __mcount_internal, and is written in such a way that it compensates
> stack misalignment in a call to __mcount_internal. __fentry__ survives
> stack misalignment, since no xmm regs are saved to the stack in the
> function.

Well, we can’t change Darwin’s libc to do something similar (and anyway the 
dynamic loader would also need to know that this was a special as well to avoid 
aborting the exe).

... however we could do a dodge where some shim code was inserted into any TU 
that used  mfentry to redirect the call to an ABI-compliant launchpad… etc. etc.

It seems we can’t instrument “exactly at the entry” .. only “pretty close to 
it”.

>> Anyway, my current plan is to disable mfentry (for Darwin) - the alternative 
>> might be some kind of “almost at the start of the function, but needing some 
>> stack alignment change”,
>> 
>> I’m interested in if you know of any compelling use-cases that would make it 
>> worth finding some work-around instead of disabling.
> 
> Unfortunately, not from the top of my head…

Well, I can’t either, so for now I’m going to make a patch to disable it (since 
it’s not fully working in any case) .. if anyone screams that there’s a major 
reduction in functionality - we can investigate some scheme as mentioned above.

thanks
Iain

Re: -Wformat-diag: floating-point or floating point?

2019-05-22 Thread Bill Schmidt

On 5/22/19 5:19 AM, Richard Earnshaw (lists) wrote:
> On 21/05/2019 21:18, Bill Schmidt wrote:
>> On 5/21/19 11:47 AM, Martin Sebor wrote:
>>> The GCC coding style says to use "floating-point" as an adjective
>>> rather than "floating point."  After enhancing the -Wformat-diag
>>> checker to detect this I found a bunch of uses of the latter, such
>>> as in:
>>>
>>>   gcc/c/c-decl.c:10944
>>>   gcc/c/c-parser.c:9423, 9446, 9450, etc.
>>>   gcc/convert.c:418, 422
>>>   gcc/cp/call.c:5070
>>>   gcc/cp/cvt.c:886
>>>
>>> Before I fix them all and adjust the tests, I want to make sure
>>> we really want to follow this rule.  The C standard uses both
>>> interchangeably.  With just one exception, the C++ standard uses
>>> the hyphenated form.
>> The hyphenated form is correct English, so I certainly prefer it. :-)
>>
> It's not quite as simple as that.  Hyphens should be used to make it
> clear what is the adjective and what is the noun:
>
>A floating-point number (hyphenated) is a number with a
>floating point (no hyphen).
>
> In the first case 'floating-point' is the adjective and qualifies
> number.  In the second case 'floating' is the adjective and qualifies
> 'point'.
>
> But this is English, so there are probably some exceptions even then -
> but not in this case, I think.  :-)

English is always fun, agreed -- Martin cited the requirement to use
"floating-point" when it's used as an adjective, which is certainly correct.

There's a more interesting question around cavalier usage such as,
"We should use floating point."  I would argue that there is an implied
noun "arithmetic" modified here, so this should also be hyphenated,
but I daresay there would be people on both sides of this one...

This is why grammar police usually die from friendly fire. :-)

Bill
>
> R.
>

Re: -Wformat-diag: floating-point or floating point?

2019-05-22 Thread Richard Earnshaw (lists)

On 22/05/2019 13:17, Bill Schmidt wrote:
> On 5/22/19 5:19 AM, Richard Earnshaw (lists) wrote:
>> On 21/05/2019 21:18, Bill Schmidt wrote:
>>> On 5/21/19 11:47 AM, Martin Sebor wrote:
 The GCC coding style says to use "floating-point" as an adjective
 rather than "floating point."  After enhancing the -Wformat-diag
 checker to detect this I found a bunch of uses of the latter, such
 as in:

   gcc/c/c-decl.c:10944
   gcc/c/c-parser.c:9423, 9446, 9450, etc.
   gcc/convert.c:418, 422
   gcc/cp/call.c:5070
   gcc/cp/cvt.c:886

 Before I fix them all and adjust the tests, I want to make sure
 we really want to follow this rule.  The C standard uses both
 interchangeably.  With just one exception, the C++ standard uses
 the hyphenated form.
>>> The hyphenated form is correct English, so I certainly prefer it. :-)
>>>
>> It's not quite as simple as that.  Hyphens should be used to make it
>> clear what is the adjective and what is the noun:
>>
>>A floating-point number (hyphenated) is a number with a
>>floating point (no hyphen).
>>
>> In the first case 'floating-point' is the adjective and qualifies
>> number.  In the second case 'floating' is the adjective and qualifies
>> 'point'.
>>
>> But this is English, so there are probably some exceptions even then -
>> but not in this case, I think.  :-)
> 
> English is always fun, agreed -- Martin cited the requirement to use
> "floating-point" when it's used as an adjective, which is certainly correct.
> 
> There's a more interesting question around cavalier usage such as,
> "We should use floating point."  I would argue that there is an implied
> noun "arithmetic" modified here, so this should also be hyphenated,
> but I daresay there would be people on both sides of this one...

I would argue that leaving out "arithmetic" is the error. :-)

> 
> This is why grammar police usually die from friendly fire. :-)
> 

Sticking your head above the parapet is always fraught with danger :)


R.

Re: -Wformat-diag: floating-point or floating point?

2019-05-22 Thread Richard Earnshaw (lists)

On 22/05/2019 13:17, Bill Schmidt wrote:

> "We should use floating point."  

In fact, since we're compiler folk, we just say it's undefined behaviour
and optimize it to

"."

:-P

R.

[AArch64 ELF ABI] Vector calls and lazy binding on AArch64

2019-05-22 Thread Szabolcs Nagy

The lazy binding code of aarch64 currently only preserves q0-q7 of the
fp registers, but for an SVE call [AAPCS64+SVE] it should preserve p0-p3
and z0-z23, and for an AdvSIMD vector call [VABI64] it should preserve
q0-q23. (Vector calls are extensions of the base PCS [AAPCS64].)

A possible fix is to save and restore the additional register state in
the lazy binding entry code, this was discussed in

  https://sourceware.org/ml/libc-alpha/2018-08/msg00017.html

the main objections were

(1) Linux may optimize the kernel entry code for processes that don't
use SVE, so lazy binding should avoid accessing SVE registers.

(2) If this is fixed in the dynamic linker, vector calls will not be
backward compatible with old glibc.

(3) The saved SVE register state can be large (> 8K), so binaries that
work today may run out of stack space on an SVE system during lazy
binding (which can e.g. happen in a signal handler on a tiny stack).

and the proposed solution was to force bind now semantics for vector
functions e.g. by not calling them via PLT. This turned out to be harder
than I expected. I no longer think (1) and (2) are critically important,
but (3) is a correctness issue which is hard to argue away (would
require larger stack allocations to accommodate the worst case stack
size increase, but the stack allocation is not always under the control
of glibc, so it cannot provide strict guarantees).

Some approaches to make symbols "bind now" were discussed at

  https://groups.google.com/forum/#!topic/generic-abi/Bfb2CwX-u4M

The ABI change draft is below the notes, it requires marking symbols
in the ELF symbol table that follow the vector PCS (or other variant
PCS conventions). This is most relevant to dynamic linkers with lazy
binding support and to ELF linkers targeting AArch64, but assemblers
will need to be updated too.

Note 1: the dynamic linker may have to run user code during lazy binding
because of ifunc resolvers, so it cannot avoid clobbering fp regs.

Note 2: the tlsdesc entry is also affected by (3), so either the the
initial DTV setup should avoid clobbering fp regs or the SVE register
state should not be callee-preserved by the tlsdesc call ABI (the latter
was chosen, which is backward compatible with old dynamic linkers, but
tls access from SVE code is as expensive as an extern call now: the
caller has to spill).

Note 3: signal frame and SVE register spills in code using SVE can also
lead to variable stack usage (AT_MINSIGSZTKSZ was introduced to address
the former issue on linux) so it is a valid approach to just increase
min stack size limits on aarch64 compared to other targets (this is less
invasive, but does not fix old binaries).

Note 4: the proposal requires marking symbols in asm and elf objects, so
it is not compatible with existing tooling (old as or ld cannot create
valid vector function symbol references or definitions) and it is only
effective with a new dynamic linker.

Note 5: -fno-plt style code generation for vector function calls might
have worked too, but on aarch64 it requires compiler and linker changes
to avoid PLT in position dependent code when that is emitted for the
sake of pointer equality. It also requires tightening the ABI to ensure
the static linker does not introduce PLT when processing certain static
relocations. This approach would generate suboptimal static linked code
(the no-plt code is hard to relax into direct calls on aarch64) fragile
(easy to accidentally introduce a PLT) and hard to diagnose.

Note 6: the proposed solution applies to both SVE calls and AdvSIMD
vector calls, even though some issues only apply to SVE.

Note 7: a separate dynamic linker entry point for variant PCS calls
may be introduced (requires further ELF changes for a PLT0 like stub)
or the dynamic linker may decide to always preserve all registers or
decide to always bind symbols at load time.


AAELF64: in the Symbol Table section add

 st_other Values
 The  st_other  member  of  a symbol table entry specifies the symbol's
 visibility in the lowest 2 bits.  The top 6 bits  are  unused  in  the
 generic  ELF ABI [SCO-ELF], and while there are no values reserved for
 processor-specific semantics, many other architectures have used these
 bits.

 The  defined  processor-specific  st_other  flag  values are listed in
 Table 4-5-1.

 Table 4-5-1, Processor specific st_other flags
 ++--+-+
 |Name| Mask | Comment |
 ++--+-+
 |STO_AARCH64_VARIANT_PCS | 0x80 | Thefunction |
 ||  | associated with the |
 ||  | symbol may follow a |
 ||  | variant   procedure |
 ||  | call  standard with |
 |

Re: -Wformat-diag: floating-point or floating point?

2019-05-22 Thread Martin Sebor


On 5/22/19 6:27 AM, Richard Earnshaw (lists) wrote:

On 22/05/2019 13:17, Bill Schmidt wrote:

On 5/22/19 5:19 AM, Richard Earnshaw (lists) wrote:

On 21/05/2019 21:18, Bill Schmidt wrote:

On 5/21/19 11:47 AM, Martin Sebor wrote:

The GCC coding style says to use "floating-point" as an adjective
rather than "floating point."  After enhancing the -Wformat-diag
checker to detect this I found a bunch of uses of the latter, such
as in:

   gcc/c/c-decl.c:10944
   gcc/c/c-parser.c:9423, 9446, 9450, etc.
   gcc/convert.c:418, 422
   gcc/cp/call.c:5070
   gcc/cp/cvt.c:886

Before I fix them all and adjust the tests, I want to make sure
we really want to follow this rule.  The C standard uses both
interchangeably.  With just one exception, the C++ standard uses
the hyphenated form.

The hyphenated form is correct English, so I certainly prefer it. :-)


It's not quite as simple as that.  Hyphens should be used to make it
clear what is the adjective and what is the noun:

A floating-point number (hyphenated) is a number with a
floating point (no hyphen).

In the first case 'floating-point' is the adjective and qualifies
number.  In the second case 'floating' is the adjective and qualifies
'point'.

But this is English, so there are probably some exceptions even then -
but not in this case, I think.  :-)


English is always fun, agreed -- Martin cited the requirement to use
"floating-point" when it's used as an adjective, which is certainly correct.

There's a more interesting question around cavalier usage such as,
"We should use floating point."  I would argue that there is an implied
noun "arithmetic" modified here, so this should also be hyphenated,
but I daresay there would be people on both sides of this one...


I would argue that leaving out "arithmetic" is the error. :-)


I agree.  Unfortunately, there are a few cases like that among
the diagnostics that my script has already fixed:

  decimal floating point not supported
  comparing floating point with %<==%> or % is unsafe
  ISO C does not support decimal floating point

They probably should read

  decimal floating point types not supported
  comparing floating-point values with %<==%> or % is unsafe
  ISO C does not support decimal floating point types

I think they can be adjusted later if we think it's important,
after the checker is finalized and committed.  I don't want to
complicate it too much by having to differentiate between
adjectives and nouns.  The vast majority of the "floating point"
instances is has found are adjectives.

Martin



This is why grammar police usually die from friendly fire. :-)



Sticking your head above the parapet is always fraught with danger :)


R.

Re: -Wformat-diag: floating-point or floating point?

2019-05-22 Thread Bill Schmidt

On 5/22/19 9:58 AM, Martin Sebor wrote:
> On 5/22/19 6:27 AM, Richard Earnshaw (lists) wrote:
>> On 22/05/2019 13:17, Bill Schmidt wrote:
>>> On 5/22/19 5:19 AM, Richard Earnshaw (lists) wrote:
 On 21/05/2019 21:18, Bill Schmidt wrote:
> On 5/21/19 11:47 AM, Martin Sebor wrote:
>> The GCC coding style says to use "floating-point" as an adjective
>> rather than "floating point."  After enhancing the -Wformat-diag
>> checker to detect this I found a bunch of uses of the latter, such
>> as in:
>>
>>    gcc/c/c-decl.c:10944
>>    gcc/c/c-parser.c:9423, 9446, 9450, etc.
>>    gcc/convert.c:418, 422
>>    gcc/cp/call.c:5070
>>    gcc/cp/cvt.c:886
>>
>> Before I fix them all and adjust the tests, I want to make sure
>> we really want to follow this rule.  The C standard uses both
>> interchangeably.  With just one exception, the C++ standard uses
>> the hyphenated form.
> The hyphenated form is correct English, so I certainly prefer it. :-)
>
 It's not quite as simple as that.  Hyphens should be used to make it
 clear what is the adjective and what is the noun:

     A floating-point number (hyphenated) is a number with a
     floating point (no hyphen).

 In the first case 'floating-point' is the adjective and qualifies
 number.  In the second case 'floating' is the adjective and qualifies
 'point'.

 But this is English, so there are probably some exceptions even then -
 but not in this case, I think.  :-)
>>>
>>> English is always fun, agreed -- Martin cited the requirement to use
>>> "floating-point" when it's used as an adjective, which is certainly
>>> correct.
>>>
>>> There's a more interesting question around cavalier usage such as,
>>> "We should use floating point."  I would argue that there is an implied
>>> noun "arithmetic" modified here, so this should also be hyphenated,
>>> but I daresay there would be people on both sides of this one...
>>
>> I would argue that leaving out "arithmetic" is the error. :-)
>
> I agree.  Unfortunately, there are a few cases like that among
> the diagnostics that my script has already fixed:
>
>   decimal floating point not supported
>   comparing floating point with %<==%> or % is unsafe
>   ISO C does not support decimal floating point
>
> They probably should read
>
>   decimal floating point types not supported
>   comparing floating-point values with %<==%> or % is unsafe
>   ISO C does not support decimal floating point types

"decimal floating point types" does use "floating-point" to modify
"types", so if you change those they should probably remain hyphenated. 
Technically the whole phrase "decimal floating point" modifies types
together, and should be hyphenated together, but that just looks fussy
and isn't common practice.  None of those details are going to solve
world hunger. :-)  Thanks for fixing the general problem!  For the edge
cases, Richard's optimization looks better and better. :-P

Bill
>
> I think they can be adjusted later if we think it's important,
> after the checker is finalized and committed.  I don't want to
> complicate it too much by having to differentiate between
> adjectives and nouns.  The vast majority of the "floating point"
> instances is has found are adjectives.
>
> Martin
>
>
>>> This is why grammar police usually die from friendly fire. :-)
>>>
>>
>> Sticking your head above the parapet is always fraught with danger :)
>>
>>
>> R.
>>
>

Re: [AArch64 ELF ABI] Vector calls and lazy binding on AArch64

2019-05-22 Thread Florian Weimer

* Szabolcs Nagy:

> AAELF64: in the Symbol Table section add
>
>  st_other Values
>  The  st_other  member  of  a symbol table entry specifies the symbol's
>  visibility in the lowest 2 bits.  The top 6 bits  are  unused  in  the
>  generic  ELF ABI [SCO-ELF], and while there are no values reserved for
>  processor-specific semantics, many other architectures have used these
>  bits.
>
>  The  defined  processor-specific  st_other  flag  values are listed in
>  Table 4-5-1.
>
>  Table 4-5-1, Processor specific st_other flags
>  ++--+-+
>  |Name| Mask | Comment |
>  ++--+-+
>  |STO_AARCH64_VARIANT_PCS | 0x80 | Thefunction |
>  ||  | associated with the |
>  ||  | symbol may follow a |
>  ||  | variant   procedure |
>  ||  | call  standard with |
>  ||  | different  register |
>  ||  | usage convention.   |
>  ++--+-+
>
>  A  symbol  table entry that is marked with the STO_AARCH64_VARIANT_PCS
>  flag set in its st_other field may be associated with a function  that
>  follows  a  variant  procedure  call  standard with different register
>  usage convention from the one  defined  in  the  base  procedure  call
>  standard  for  the  list  of  argument,  caller-saved and callee-saved
>  registers [AAPCS64].  The rules  in  the  Call  and  Jump  relocations
>  section  still  apply to such functions, and if a subroutine is called
>  via a symbol reference that  is  marked  with  STO_AARCH64_VARIANT_PCS
>  then  code that runs between the calling routine and called subroutine
>  must preserve the contents of all registers except IP0,  IP1  and  the
>  condition code flags [AAPCS64].

Can you clarify if there has to be a valid stack at this point which can
be used during the call transfer?  What about the stack alignment
requirement?

Thanks,
Florian

Re: [AArch64 ELF ABI] Vector calls and lazy binding on AArch64

2019-05-22 Thread Szabolcs Nagy

On 22/05/2019 16:06, Florian Weimer wrote:
> * Szabolcs Nagy:
> 
>> AAELF64: in the Symbol Table section add
>>
>>  st_other Values
>>  The  st_other  member  of  a symbol table entry specifies the symbol's
>>  visibility in the lowest 2 bits.  The top 6 bits  are  unused  in  the
>>  generic  ELF ABI [SCO-ELF], and while there are no values reserved for
>>  processor-specific semantics, many other architectures have used these
>>  bits.
>>
>>  The  defined  processor-specific  st_other  flag  values are listed in
>>  Table 4-5-1.
>>
>>  Table 4-5-1, Processor specific st_other flags
>>  ++--+-+
>>  |Name| Mask | Comment |
>>  ++--+-+
>>  |STO_AARCH64_VARIANT_PCS | 0x80 | Thefunction |
>>  ||  | associated with the |
>>  ||  | symbol may follow a |
>>  ||  | variant   procedure |
>>  ||  | call  standard with |
>>  ||  | different  register |
>>  ||  | usage convention.   |
>>  ++--+-+
>>
>>  A  symbol  table entry that is marked with the STO_AARCH64_VARIANT_PCS
>>  flag set in its st_other field may be associated with a function  that
>>  follows  a  variant  procedure  call  standard with different register
>>  usage convention from the one  defined  in  the  base  procedure  call
>>  standard  for  the  list  of  argument,  caller-saved and callee-saved
>>  registers [AAPCS64].  The rules  in  the  Call  and  Jump  relocations
>>  section  still  apply to such functions, and if a subroutine is called
>>  via a symbol reference that  is  marked  with  STO_AARCH64_VARIANT_PCS
>>  then  code that runs between the calling routine and called subroutine
>>  must preserve the contents of all registers except IP0,  IP1  and  the
>>  condition code flags [AAPCS64].
> 
> Can you clarify if there has to be a valid stack at this point which can
> be used during the call transfer?  What about the stack alignment
> requirement?

the intention is to only allow 'register usage convention' to be
relaxed compared to the base PCS (which has rules for stack etc),
and even the register usage convention has to be compatible with
the 'Call and Jump relocations section' which essentially says that
veneers inserted by the linker between calls can clobber IP0, IP1
and the condition flags.

i.e. a variant pcs function follows the same rules as base pcs, but
it may use different caller-/callee-saved/argument regiseters.

when SVE pcs is merged into the current AAPCS document, then i hope
the 'variant pcs' term used here will be properly specified so the
ELF ABI will just refer back to that.

Re: [AArch64 ELF ABI] Vector calls and lazy binding on AArch64

2019-05-22 Thread Florian Weimer

* Szabolcs Nagy:

> On 22/05/2019 16:06, Florian Weimer wrote:
>> * Szabolcs Nagy:
>> 
>>> AAELF64: in the Symbol Table section add
>>>
>>>  st_other Values
>>>  The  st_other  member  of  a symbol table entry specifies the symbol's
>>>  visibility in the lowest 2 bits.  The top 6 bits  are  unused  in  the
>>>  generic  ELF ABI [SCO-ELF], and while there are no values reserved for
>>>  processor-specific semantics, many other architectures have used these
>>>  bits.
>>>
>>>  The  defined  processor-specific  st_other  flag  values are listed in
>>>  Table 4-5-1.
>>>
>>>  Table 4-5-1, Processor specific st_other flags
>>>  ++--+-+
>>>  |Name| Mask | Comment |
>>>  ++--+-+
>>>  |STO_AARCH64_VARIANT_PCS | 0x80 | Thefunction |
>>>  ||  | associated with the |
>>>  ||  | symbol may follow a |
>>>  ||  | variant   procedure |
>>>  ||  | call  standard with |
>>>  ||  | different  register |
>>>  ||  | usage convention.   |
>>>  ++--+-+
>>>
>>>  A  symbol  table entry that is marked with the STO_AARCH64_VARIANT_PCS
>>>  flag set in its st_other field may be associated with a function  that
>>>  follows  a  variant  procedure  call  standard with different register
>>>  usage convention from the one  defined  in  the  base  procedure  call
>>>  standard  for  the  list  of  argument,  caller-saved and callee-saved
>>>  registers [AAPCS64].  The rules  in  the  Call  and  Jump  relocations
>>>  section  still  apply to such functions, and if a subroutine is called
>>>  via a symbol reference that  is  marked  with  STO_AARCH64_VARIANT_PCS
>>>  then  code that runs between the calling routine and called subroutine
>>>  must preserve the contents of all registers except IP0,  IP1  and  the
>>>  condition code flags [AAPCS64].
>> 
>> Can you clarify if there has to be a valid stack at this point which can
>> be used during the call transfer?  What about the stack alignment
>> requirement?
>
> the intention is to only allow 'register usage convention' to be
> relaxed compared to the base PCS (which has rules for stack etc),
> and even the register usage convention has to be compatible with
> the 'Call and Jump relocations section' which essentially says that
> veneers inserted by the linker between calls can clobber IP0, IP1
> and the condition flags.
>
> i.e. a variant pcs function follows the same rules as base pcs, but
> it may use different caller-/callee-saved/argument regiseters.
>
> when SVE pcs is merged into the current AAPCS document, then i hope
> the 'variant pcs' term used here will be properly specified so the
> ELF ABI will just refer back to that.

My concern is that with the current language, it's not clear whether
it's possible to use the stack as a scratch area during the call
transition, or rely on a valid TCB.  I think this is rather
underspecified.

Thanks,
Florian

Re: [AArch64 ELF ABI] Vector calls and lazy binding on AArch64

2019-05-22 Thread Szabolcs Nagy

On 22/05/2019 16:34, Florian Weimer wrote:
> * Szabolcs Nagy:
> 
>> On 22/05/2019 16:06, Florian Weimer wrote:
>>> * Szabolcs Nagy:
>>>
 AAELF64: in the Symbol Table section add

  st_other Values
  The  st_other  member  of  a symbol table entry specifies the symbol's
  visibility in the lowest 2 bits.  The top 6 bits  are  unused  in  the
  generic  ELF ABI [SCO-ELF], and while there are no values reserved for
  processor-specific semantics, many other architectures have used these
  bits.

  The  defined  processor-specific  st_other  flag  values are listed in
  Table 4-5-1.

  Table 4-5-1, Processor specific st_other flags
  ++--+-+
  |Name| Mask | Comment |
  ++--+-+
  |STO_AARCH64_VARIANT_PCS | 0x80 | Thefunction |
  ||  | associated with the |
  ||  | symbol may follow a |
  ||  | variant   procedure |
  ||  | call  standard with |
  ||  | different  register |
  ||  | usage convention.   |
  ++--+-+

  A  symbol  table entry that is marked with the STO_AARCH64_VARIANT_PCS
  flag set in its st_other field may be associated with a function  that
  follows  a  variant  procedure  call  standard with different register
  usage convention from the one  defined  in  the  base  procedure  call
  standard  for  the  list  of  argument,  caller-saved and callee-saved
  registers [AAPCS64].  The rules  in  the  Call  and  Jump  relocations
  section  still  apply to such functions, and if a subroutine is called
  via a symbol reference that  is  marked  with  STO_AARCH64_VARIANT_PCS
  then  code that runs between the calling routine and called subroutine
  must preserve the contents of all registers except IP0,  IP1  and  the
  condition code flags [AAPCS64].
>>>
>>> Can you clarify if there has to be a valid stack at this point which can
>>> be used during the call transfer?  What about the stack alignment
>>> requirement?
>>
>> the intention is to only allow 'register usage convention' to be
>> relaxed compared to the base PCS (which has rules for stack etc),
>> and even the register usage convention has to be compatible with
>> the 'Call and Jump relocations section' which essentially says that
>> veneers inserted by the linker between calls can clobber IP0, IP1
>> and the condition flags.
>>
>> i.e. a variant pcs function follows the same rules as base pcs, but
>> it may use different caller-/callee-saved/argument regiseters.
>>
>> when SVE pcs is merged into the current AAPCS document, then i hope
>> the 'variant pcs' term used here will be properly specified so the
>> ELF ABI will just refer back to that.
> 
> My concern is that with the current language, it's not clear whether
> it's possible to use the stack as a scratch area during the call
> transition, or rely on a valid TCB.  I think this is rather
> underspecified.

i think that's underspecified in general for normal calls too,
currently the glibc dynamic linker assumes it can use some stack
space and do various async signal safe operations (some of which
may even fail), variant pcs does not change any of this.

it only provides a per symbol escape hatch for functions with a
bit special call convention, and i plan to use the symbol marking
in glibc as 'force bind now for these symbols', because other
behaviour may not be forward compatible if the architecture
changes again (if lazy binding turns out to be very important
for these symbols i'd prefer introducing a second entry point
for them instead of checking the elf flags from the entry asm).

i'll try to post patches implementing this abi soon.

RTL Layer Paralleling

2019-05-22 Thread nick

Greetings All,

After getting it some thought I've yet to see any real focus on making the RTL
layers or just the architectural backends parallel in nature. Is there a reason
for this as my assumption is that after reading parts of the x86 and rs6000 
backend code it's not needed. 

This seems to be the case as most of the profiling only really shows bottle 
necking
at least in my research at the gimple layer. Is this because little time has 
been
put into researching the RTL passes or it's just assumed to have no requirements
require to gimple and perhaps post GENERIC/pre GIMPLE passes?

Just curious as I've not a RTL or backend expert,

Nick

On-Demand range technology [2/5] - Major Components : How it works

2019-05-22 Thread Andrew MacLeod

*This note will talk about the 4 major components of the prototype and 
explain how they work together.   I will be fairly light on detail just 
to give an overview, we can delve into whatever details are needed.

- Range-ops : Range operations at the statement level
- GORI - Generates Outgoing Range Info : Ranges generated by basic blocks
- GORI Cache - combining and caching basic-block range info
- Ranger - API and query control

1 * Range-ops

The first major component is the centralized range operation database. 
This is where range operations for tree-codes are implemented.  The 
compiler currently has code which can fold ranges, but the new mechanism 
is a general purpose solver which can solve for other operands. If there 
are N input/output operands, and we have ranges for N-1, It is often 
possible to derive the missing range.   ie

    lhs = op1 + op2
The existing code in the compiler can solve for LHS given ranges for OP1 
and OP2. This has been extended so that we can also sometimes solve for 
either op1 or op2 e, ie

    [20,40] = op1 + [5, 10]
...can calculate that op1 has the range [15, 30]

This ability to solve for the other operands provides the ability to 
calculate ranges in the reverse order we are accustomed to, and is key 
to enabling the on-demand range approach.

    a_2 = b_1 - 20
    if (a_2 < 40)
 A conditional jump has an implicit boolean LHS depending on which edge 
is taken. To evaluate ranges on the TRUE edge of the branch, the LHS is 
[1,1]. To calculate the range for a_2 we simply solve the equation:

    [1,1] = a_2 < 40
which provides the answer as [0,39].  Furthermore, substituting this 
range for a_2 as the LHS of it’s definition statement:

    [0,39] = b_1 - 20
The same evaluation  mechanism can calculate a result for b_1 on the 
true edge as [20,59]. This is the key feature which allows the rest of 
the algorithm to work in a general way.


All operations are driven from this component, and the only thing 
required to add support for another tree code is to implement one or 
more methods for it here, and the rest of the range mechanism will 
simply work with it.



2 * GORI

The second component is the “Generates Outgoing Range Info” engine.  
This is a basic-block oriented component which determines what ssa-names 
have ranges created on outgoing edges, and how to calculate those ranges.


The last statement in the block is examined to see if it is a branch 
which generates range info, and then determines which ssa-names in the 
block can have ranges calculated.  It quickly walks the use-def chain 
from the branch to find other definitions within the block that can 
impact the branch and could also have their ranges calculated. This 
summary is then cached for future use.


The GORI component also uses this summary info to perform the 
calculations necessary to determine the outgoing range for any ssa_name 
which can be determined.  For example:

    c_3 = foo ()
    a_2 = b_1 - 20
    If (a_2 < 40)
The summary information would indicate that b_1 and a_2 can have their 
outgoing ranges calculated for this block, and uses the cached 
information to quickly calculate them when required.

The API consists of 2 basic methods, query and calculate:
 - bool has_edge_range_p (edge, name);
 - range outgoing_edge_range (edge, name);
If a query is made for any other ssa-name, it simply returns false and 
says this block does not generate a range for that name.  This is a key 
rationale for the summary so we only spend time processing names for 
blocks that actually have ranges generated.



3 * GORI cache
-
The third component builds on the basic block GORI component, and adds 
the ability to traverse the CFG and combine the various ranges provided 
by each basic block into a cache.


It provides both a global-range cache and a range-on-entry cache 
summarizing the range of an ssa_name on entry to the block.


The range-on-entry cache is self filling, and iteratively evaluates back 
edges. There is a single API entry point which asks for the range on 
entry, and if the data has not been calculated/cached already, it spawns 
the following process to calculate the data. :
    * Walk the CFG from the current block to the definition block, 
and/or any block which has range-on-entry information already 
calculated. These end points are the source blocks.

    * Any non-source blocks visited are added to a worklist to be updated.
    * Starting with the source blocks, push the known outgoing range 
from each block to each successor and update their live-on-entry values 
when they are visited.  If the incoming range changes, mark this block’s 
successors as requiring updating.

    * Continue iterating until no more updates are required.

The ordering is planned by the ranger such that if there are no back 
edges, it is typically a straightforward single pass.  When back edges 
are involved, the amount of iterating is typically ver

On-Demand range technology [1/5] - Executive Summary

2019-05-22 Thread Andrew MacLeod

Now that stage 1 has reopened, I’d like to reopen a discussion about the 
technology and experiences we have from the Ranger project I brought up 
last year. https://gcc.gnu.org/ml/gcc/2018-05/msg00288.html .  (The 
original wiki pages are now out of date, and I will work on updating 
them soon.)


The Ranger is designed to evaluate ranges on-demand rather than through 
a top-down approach. This means you can ask for a range from anywhere, 
and it walks back thru the IL satisfying any preconditions and doing the 
required calculations.  It utilizes a cache to avoid re-doing work. If 
ranges are processed in a forward dominator order, it’s not much 
different than what we do today. Due to its nature, the order you 
process things in has minimal impact on the overall time… You can do it 
in reverse dominator order and get similar times.


It requires no outside preconditions (such as dominators) to work, and 
has a very simple API… Simply query the range of an ssa_name at any 
point in the IL and all the details are taken care of.


We have spent much of the past 6 months refining the prototype (branch 
“ssa-range”) and adjusting it to share as much code with VRP as 
possible. They are currently using a common code base for extracting 
ranges from statements, as well as simplifying statements.


The Ranger deals with just  ranges. The other aspects of  VRP are 
intended to be follow on work that integrates tightly with it, but are 
also independent and would be available for other passes to use.  These 
include:

- Equivalency tracking
- Relational processing
- Bitmask tracking

We have implemented a VRP pass that duplicates the functionality of EVRP 
(other than the bits mentioned above), as well as converted a few other 
passes to use the interface.. I do not anticipate those missing bits 
having a significant impact on the results.


The prototype branch it quite stable and can successfully build and test 
an entire Fedora distribution (9174 packages). There is an issue with 
switches I will discuss later whereby the constant range of a switch 
edge is not readily available and is exponentially expensive to 
calculate. We have a design to address that problem, and in the common 
case we are about 20% faster than EVRP is.


When utilized in passes which only require ranges for a small number of 
ssa-names we see significant improvements.  The sprintf warning pass for 
instance allows us to remove the calculations of dominators and the 
resulting forced walk order. We see a 95% speedup (yes, 1/20th of the 
overall time!).  This is primarily due to no additional overhead and 
only calculating the few things that are actually needed.  The walloca 
and wrestrict passes are a similar model, but as they have not been 
converted to use EVRP ranges yet, we don’t see similar speedups there.


That is the executive summary.  I will go into more details of each 
major thing mentioned in follow on notes so that comments and 
discussions can focus on one thing at a time.


We think this approach is very solid and has many significant benefits 
to GCC. We’d like to address any concerns you may have, and work towards 
finding a way to integrate this model with the code base during this 
stage 1.


Comments and feedback always welcome!
Thanks
Andrew

On-Demand range technology [3/5] - The Prototype

2019-05-22 Thread Andrew MacLeod

There is a functioning prototype in branch “ssa-range” which is a proof 
of concept that the approach is functional as well as quick, and can be 
used to answer questions which come up regarding what it can and can’t 
do.  Our last merge was on April 13th, so it's fairly up to date.


We have implemented a flexible range class (irange) which allows for 
multiple subranges to represent a range, and which can be extended in 
the future to types other than integral. We use this throughout, but it 
could be replaced in the ranger with any similar API. Conversion 
routines are also provided to convert from irange to value_range and 
value_range to irange.


A full set of tree_code range-op routines are implemented.  We have 
commoned as much code as possible with the existing VRP range extraction 
code.  Also, we have added additional code for calculating the other 
operands from a known result in numerous cases.


The code base in VRP has been modified (via a flag) to
    - Work completely with the native value_range like it does today.
    - Use irange and the range-ops component under the covers to 
extract ranges. Requests in VRP are then converted from value_ranges to 
irange, called into the range-op routines, and then converted back to 
value_range for VRP/EVRP’s use.
    - Do operations both ways and compare the results to make sure both 
agree on the range, and trap if they do not.


The branch defaults to the compare and trap mode to ensure everything is 
working correctly. This has been our correctness model for statement 
range extraction and was active during the Fedora package builds. The 
only time we disabled it was to do performance runs vs RVRP, and were 
looking at both branch and trunk times for EVRP and VRP.


Of note, we drop all symbolics in ranges to VARYING on everything except 
PLUS and MINUS, which we leave as native calculations if there are 
symbolics present.  More on symbolics later.


A VRP like pass called RVRP has been implemented.
    - The vr-values statement simplification code has been factored out 
to be range agnostic, meaning that these routines can operate on either 
value_range or irange. Thus, we are using a common code base to perform 
statement simplification as well.
    - For complete compatibility with EVRP, the RVRP pass builds 
dominators and instantiates the SCEV loop information so we have loop 
range info available. RVRP does not need this info to run, but would 
miss some of the test cases which depend on loop ranges.
    - RVRP is set up to demonstrate it can process the IL in multiple 
directions and bootstraps/passes all tests in all directions.

        * Dominator order
        * Post-dominator order
        * BB1 thru BBn
        * BBn thru BB1
        * branch-only mode where only branches at the end of each BB 
are examined for folding opportunities


4 additional passes have been converted to use the ranger model:
    - sprintf - removed the dominator building/walking
    - warn alloca - replaced calls to get global ranges with calls that 
now return context sensitive ranges.

    - warn restrict - just replaced EVRP range calls with ranger calls.
    - backwards threader - enhanced to use contextual range information 
to make additional threading decisions.



Symbolic Ranges

One big concern last year expressed was my decision to abolish symbolic 
ranges.


I continue to maintain that there is no need to track the range of x_2 
as [y_3 + 5, MAX] for x_2 = y_3 + 5.   All you need to do is look at the 
definition of x_2, and the same information is exposed right there in 
the IL.  If one requires the symbolic information, the same on-demand 
process could lookup that information as well. This in turn, makes the 
code for ranges much simpler, easier to maintain, and less likely to 
introduce bugs.


We have found through our prototype that symbolics in ranges are not 
nearly as prevalent as one would think.   During the work to share a 
common code base with VRP, we found that symbolic ranges are irrelevant 
for everything except PLUS_EXPR and MINUS_EXPR. The shared code in our 
branch drops symbolics to varying immediately for everything else, and 
it has no impact on EVRP, or VRP or any tests we can find anywhere.  
Furthermore, we never trapped while comparing ranges generated by VRP 
versus generating them with range-ops which drops symbolics to varying.


We tried modifying VRP such that we don’t even create symbolic 
endpoints, but rather revert to VARYING always.  We can find no test 
case that fails because a range is not calculated properly due to 
resolving these endpoints.


There are a few that fail due to the symbolic being used to help track 
relationals.. Ie


 x_2 = y_3 + 5
    If (x_2 > y_3) // can be folded since we know x_2 must be < y_3

VRP generates a range for x of  [ y_3+5, MAX ] and at various points 
uses that to infer a relational or equivalence.   Ie, it becomes easy to 
tell that the condition must always be true

On-Demand range technology [4/5] - Performance results

2019-05-22 Thread Andrew MacLeod

We have done extensive performance analysis to help address concerns 
about the nature of an on-demand model. LLVM made an attempt at 
something similar,  but suffered from significant performance issues 
they could not solve with their approach. This approach is not the same, 
and we have seen no sign of pathological cases which are problematic.


I have trolled bugzilla looking for large problematic test cases, and 
have tried a couple of test cases from LLVM which dberlin pointed out 
they found to be problematic.  To date, I have found nothing that shows 
any kind of performance problem.  I am more than happy to entertain 
whatever cases y’all might want to throw my way.


For a test of robustness, we have built a complete Fedora distribution 
consisting of 9174 packages with the ranger branch. All packages except 
3 build successfully and pass the relevant regression tests. It appears 
2 of them are related to RVRP and are still under analysis.


Our primary performance testbase to date has been the compiler itself. 
We compile 242 .ii files from a stage1 compiler build, and compare times 
to EVRP. VRP is quite a bit slower, and although we do have an iterative 
update infrastructure, comparisons aren’t quite fair since we don’t do 
all equivalence and bitmask operations it does yet.


Before going into the numbers, I would like to visit a minor issue we 
have with switches. RVRP works from the bottom up, so in order to 
evaluate a range, it begins by getting the constant range for the LHS of 
the branch from the edge. For a conditional it is trivially [0,0] or 
[1,1] depending on TRUE or FALSE edge.


For a switch, it turns out GCC gimple representation has no simple way 
to figure this out. As a result, we need to loop over every edge in the 
switch and then union together all the cases which share that edge, or 
in the default case, intersect out all the other cases. This turns out 
to be *very* time consuming in tests cases with very large switches,  
somewhere in the vicinity of O(n^3). Ugg.  So the ranger incurs a fair 
amount of overhead trying to evaluate, and re-evaluate these constant 
edges.


There are ways to address this… but for now we will present performance 
numbers with different switch configurations, each of the 5 
configurations listed here:

    1 - Calculating ranges outright  from the stock branch.
    2 - Timers adjusted to exclude switch edge calculation code (i.e. 
pretend the range is available on the edge like it is with TRUE and FALSE.
    3 - Do not process switches.  We spend extra time on switches 
because we always attempt to calculate ranges very precisely as if we 
had infinite precision. There is room for trimming outcomes here, but we 
have made no attempt yet.
    4 - Just like EVRP,  RVRP currently includes building the 
dominators and integrating calls into SCEV at the beginning of each 
block to see if there are any loop range refinements.  The ranger has no 
need for this to operate, and many passes will not care.  So we produce 
a 4th number for RVRP where we don’t build any of the infrastructure it 
doesn’t need.
    5 - RVRP can also run in conditional-only mode. Rather than walking 
the entire IL trying to resolve every range, it can simply look at the 
last statement of every block asking if the branch can be folded.  This 
gets a lot of what vrp gets that affects the CFG and could be utilized 
in either lower optimization levels, or as VRP if we can push all the 
other activities it does into appropriate optimizations (such as making 
CCP range aware).  **NOTE**: This *still* calculates switch ranges, 
so includes that slow down.


All times are with a release configured compiler.  Out of the 242 files, 
pretty much across the board in all 4 sets of figures, RVRP was faster 
in about 90% of the cases, and slower in the other 10%, resulting in the 
following cumulative totals.


                            Overall (242 files)
1 - Raw RVRP                          22% slower
2 - No edge calculation(1)                4.5% slower
3 - No switch processing(2)                9.5% faster
4 - No dominators(3)                    16% faster
5 - Conditionals (including switches) only        4% faster

These numbers indicate that large switches are the primary cause when 
RVRP is slower.  We have various approaches we could use to address 
this.   Removing the time spent building unnecessary dominators shows a 
significant improvement as well.


We also have the results for the time spent in the passes we converted:

                Overall (242 files)
1 - -wrestrict            19% faster
2 - -wprintf            95% faster
3 - -walloca            1% slower

-wrestrict has the dominator walk removed since it is no longer needed, 
and simply calls into a ranger to get the range.


-wprintf has had the dominator build removed, as well as the EVRP range 
walk. It really benefits from very few ranges being requested… so the on 
demand approach is a big win here

On-Demand range technology [5/5] - Looking to the future.

2019-05-22 Thread Andrew MacLeod

A primary goal of this approach is to try to pull the various aspects of 
VRP apart and make them individually viable so they can be used at 
appropriate places as needed.  The various components of VRP were 
identified as:

    - Ranges
    - Relational queries
    - Equivalencies
    - Bitmask tracking
    - Symbolic range endpoint resolution

This prototype implementation tackles only the range component. It makes 
ranges easily accessible from anywhere in the compiler.


I have plans for addressing these components within the same model, but 
maintaining their independence.  This should make maintenance easier, 
less error prone, and ultimately be far more flexible as other passes 
can utilize whichever aspects they need.



Symbolic range endpoint resolution
--

I touched on this in the prototype section, and maintain that the only 
requirement for symbols falls under the equivalence and relational tracking.

    x_2 = y_3 + 5
    If (x_2 > y_3)
Ranges themselves in VRP are eventually resolved to a constant for 
export to the global range table.  At this point, whatever range is 
known for the symbolic is substituted into the value_range, and any 
expression resolved to come up with the final non-symbolic range.

    X_2 = [y_3 + 5, MAX]
If y_3 evaluates to [20, 30], then x_2 is resolved as [25, MAX].

The ranger does this by default on the fly due to its nature. When the 
range of x_2 is requested the first time, it evaluates y_3 , comes up 
with the same [20, 30] range for y_3, and evaluates it to [25,max] 
immediately.


The facility is there to reevaluate the range if the range of y_3 
changes, but to this point it has not been needed. Typically it involves 
y_3 derived in some way from a back edge, and also being derived by yet 
another ssa-name from a different back edge. So, not super common.    
However, I do plan to get to this eventually to enable those 
re-calculations. For the protype, it has not been deemed critical since 
EVRP doesn't even do back edges.


Equivalencies and other Relationals
--

The relationship between ssa names are the primary use of symbols in 
ranges today, but the actual property of relations and equivalencies has 
little to do with ranges.


I propose that we utilize the exact same model as the range operation 
database to track relationships. Both equivalencies and relationals can 
be combined as “==” and “!=” is merely another relation.   Each tree 
code has a query to ask for the relationship between any of its 
operands. Ie:

    y_2 = x_3
    j_4 = y_2 + 6
    If (j_4 > x_3)
Knowing the ranges of j_4 and x_3 don’t really help resolve the 
condition.  If x_3 is varying, or even a non-constant, we know nothing 
at all, at least from a range perspective.


Applying the same calculation model the ranger uses from a relational 
point of view, range-ops can be given a relational interface in which 
each tree code can evaluate the relation between its operands.   A copy 
would return “==” for the relation between the LHS and op1, so we’d have 
the relation y_2 == x_3


Operator plus would look at its operands, and be able to indicate J_4 < 
Y_2 because operand 2 is a positive constant.


The branch is the one we care about, and a query would be made for the 
relation between j_4 and x_3.  By combining the relations that feed it, 
we’d get the j_4 < (y_2 == x_3), and the relational result would be j_4 
< x_3.  When applied to (j_4 > x_3) the result is false.


So the relational query would be able to answer the question without 
ever looking at a range, although if a range is available, it may help 
refine the answer.  The lookup process is identical to the way ranges 
are currently handled, which means the same query infrastructure can be 
leveraged and used independently or in concert with ranges.


This would also benefit from not carrying information around unless it 
is requested/required. Currently all equivalences must be stored in case 
we need to know if there is an equivalency. Just like with ranges, this 
model would have no need to even look at an equivalency unless there is 
an actual need to know.


Clearly there is work to do, but this has a lot of potential as a follow 
up to the range work since it uses the same infrastructure. Any pass 
could cheaply ask about the equivalence/relation between any 2 ssa_names.



Bitmask tracking

Not to sound like a broken record, but the exact same process can also 
be applied to bitmasks.

    X_2 = y_1 | 0x01
    If (x_2 == 2)    // can never be true

Bitwise operators, as well as other operators like *, /, shifts, etc can 
calculate bitmasks  in exactly the same way ranges are calculated. I 
also considered adding them as an element of the range class, but that 
would complicate the range class, and I maintain that keeping this all 
independant is better from both a maintainability and correctne

Re: Determining maximum vector length supported by the CPU?

Re: Determining maximum vector length supported by the CPU?

Re: Determining maximum vector length supported by the CPU?

Re: Determining maximum vector length supported by the CPU?

Re: Determining maximum vector length supported by the CPU?

Re: Determining maximum vector length supported by the CPU?

Re: -Wformat-diag: floating-point or floating point?

eud7

Re: mfentry and Darwin.

Re: -Wformat-diag: floating-point or floating point?

Re: -Wformat-diag: floating-point or floating point?

Re: -Wformat-diag: floating-point or floating point?

[AArch64 ELF ABI] Vector calls and lazy binding on AArch64

Re: -Wformat-diag: floating-point or floating point?

Re: -Wformat-diag: floating-point or floating point?

Re: [AArch64 ELF ABI] Vector calls and lazy binding on AArch64

Re: [AArch64 ELF ABI] Vector calls and lazy binding on AArch64

Re: [AArch64 ELF ABI] Vector calls and lazy binding on AArch64

Re: [AArch64 ELF ABI] Vector calls and lazy binding on AArch64

RTL Layer Paralleling

On-Demand range technology [2/5] - Major Components : How it works

On-Demand range technology [1/5] - Executive Summary

On-Demand range technology [3/5] - The Prototype

On-Demand range technology [4/5] - Performance results

On-Demand range technology [5/5] - Looking to the future.

25 matches

Site Navigation

Mail list logo

Footer information