Re: Determining maximum vector length supported by the CPU?
Hi Martin, I agree, we need more information from the compiler. Esp. whether the user specified `-mprefer-avx128` or `-mprefer-vector-width=none/128/256/512`. OTOH `-msve-vector-bits=N` is reported as __ARM_FEATURE_SVE_BITS. So that's covered. Related: PR83875 - because while we're adding things in that area, it'd be nice if they worked with target clones as well. Are you aware of std::experimental::simd? It didn't make GCC 9.1, but you can easily patch your (installed) libstdc++ using https://github.com/VcDevel/ std-simd. Cheers, Matthias On Mittwoch, 22. Mai 2019 08:39:25 CEST Martin Reinecke wrote: > [Disclaimer: I sent this to gcc-help two weeks ago, but didn't get an > answer. Maybe the topic is more suited for the main gcc list ... I > really think the feature in question would be extremely useful to have, > and easy to add!] > > Hi, > > I'm currently writing an FFT library which tries to make use of SIMD > instructions and uses a lot of variables with > __attribute__ ((vector_size (xyz)) > > The resulting source is nicely portable and architecture-independent - > except for the one place where I need to determine the maximum > hardware-supported vector length on the target CPU. > > This currently looks like > > #if defined(__AVX__) > constexpr int veclen=32; > #elif defined(__SSE2__) > constexpr int veclen=16; > [...] > > This approach requires me to add an #ifdef for many architectures, most > of which I cannot really test on ... and new architectures will be > unsupported by default. > > Hence my question: is there a way in gcc to determine the hardware > vector length for the architecture the compiler is currently targeting? > Some predefined macro like > > HARDWARE_VECTOR_LENGTH_IN_BYTES > > which is 32 for AVX, 16 for SSE2, and has proper values for Neon, VPX > etc. etc. > > If this is not provided at the moment, would it bo possible to add this > in the future? This could massively simplify writing and maintaining > multi-platform SIMD code. > > Thanks, > Martin // -- ── Dr. Matthias Kretzhttps://kretzfamily.de GSI Helmholtzzentrum für Schwerionenforschung https://gsi.de SIMD easy and portable https://github.com/VcDevel/Vc ──
Re: Determining maximum vector length supported by the CPU?
Hi Matthias! > I agree, we need more information from the compiler. Esp. whether the user > specified `-mprefer-avx128` or `-mprefer-vector-width=none/128/256/512`. > OTOH `-msve-vector-bits=N` is reported as __ARM_FEATURE_SVE_BITS. So that's > covered. Almost ... except that I'd need a platform-agnostic definition. The point is that the code does not care about the underlying hardware at all, only for the vector length supported by it. > Related: PR83875 - because while we're adding things in that area, it'd be > nice if they worked with target clones as well. Yes, this is a problem I've come across as well in the past. (https://gcc.gnu.org/ml/gcc-help/2018-10/msg00118.html) > Are you aware of std::experimental::simd? It didn't make GCC 9.1, but you > can easily patch your (installed) libstdc++ using https://github.com/VcDevel/ > std-simd. This looks extremely interesting! I have to look at it in more detail, but this might be the way to go in the future. However, the code I'm working on may be incorporated into numpy/scipy at some point, and the minimum required compilers for these packages are pretty old. I can't expect more than vanilla C++11 support there. Cheers, Martin
Re: Determining maximum vector length supported by the CPU?
On Wed, May 22, 2019 at 10:36 AM Martin Reinecke wrote: > > Hi Matthias! > > > I agree, we need more information from the compiler. Esp. whether the user > > specified `-mprefer-avx128` or `-mprefer-vector-width=none/128/256/512`. > > OTOH `-msve-vector-bits=N` is reported as __ARM_FEATURE_SVE_BITS. So that's > > covered. > > Almost ... except that I'd need a platform-agnostic definition. The > point is that the code does not care about the underlying hardware at > all, only for the vector length supported by it. And then you run into AVX + SSE vs. AVX2 + SSE cases where the (optimal) length depends on the component type... I wonder if we'd want to have a 'auto' length instead ;) I suppose exposing a __BIGGEST_VECTOR__ might be possible (not for SVE though?). > > Related: PR83875 - because while we're adding things in that area, it'd be > > nice if they worked with target clones as well. > > Yes, this is a problem I've come across as well in the past. > (https://gcc.gnu.org/ml/gcc-help/2018-10/msg00118.html) > > > Are you aware of std::experimental::simd? It didn't make GCC 9.1, but you > > can easily patch your (installed) libstdc++ using > > https://github.com/VcDevel/ > > std-simd. > > This looks extremely interesting! I have to look at it in more detail, > but this might be the way to go in the future. > However, the code I'm working on may be incorporated into numpy/scipy at > some point, and the minimum required compilers for these packages are > pretty old. I can't expect more than vanilla C++11 support there. > > Cheers, > Martin >
Re: Determining maximum vector length supported by the CPU?
On 5/22/19 11:17 AM, Richard Biener wrote: >> Almost ... except that I'd need a platform-agnostic definition. The >> point is that the code does not care about the underlying hardware at >> all, only for the vector length supported by it. > > And then you run into AVX + SSE vs. AVX2 + SSE cases where the (optimal) > length > depends on the component type... You mean different vector lengths for float, double, int etc? I would be fine to have different macros for those, if necessary. > I wonder if we'd want to have a 'auto' length instead ;) I know it's weird, but for my code (which is definitely not a synthetic test case) that would work perfectly :) If you'd like to have a look: https://gitlab.mpcdf.mpg.de/mtr/pocketfft/tree/cpp (the only platform-dependent part is on lines 82-95 of the header file). Still, I would need a way to determine how long the vectors actually are. But it would probably be enough to measure this at runtime then. Cheers, Martin
Re: Determining maximum vector length supported by the CPU?
On Mittwoch, 22. Mai 2019 11:17:57 CEST Richard Biener wrote: > On Wed, May 22, 2019 at 10:36 AM Martin Reinecke > wrote: > > Hi Matthias! > > > > > I agree, we need more information from the compiler. Esp. whether the > > > user > > > specified `-mprefer-avx128` or `-mprefer-vector-width=none/128/256/512`. > > > OTOH `-msve-vector-bits=N` is reported as __ARM_FEATURE_SVE_BITS. So > > > that's > > > covered. > > > > Almost ... except that I'd need a platform-agnostic definition. The > > point is that the code does not care about the underlying hardware at > > all, only for the vector length supported by it. > > And then you run into AVX + SSE vs. AVX2 + SSE cases where the (optimal) > length depends on the component type... > > I wonder if we'd want to have a 'auto' length instead ;) > > I suppose exposing a __BIGGEST_VECTOR__ might be possible (not for SVE > though?). AVX vs. AVX2 is a valid question. std::experimental::simd solves it by having the size depend on the element type. So I agree, `vector_size(auto)` seems like a possible solution. One could then take the sizeof if the number is important to know. -- ── Dr. Matthias Kretzhttps://kretzfamily.de GSI Helmholtzzentrum für Schwerionenforschung https://gsi.de SIMD easy and portable https://github.com/VcDevel/Vc ──
Re: Determining maximum vector length supported by the CPU?
On Mittwoch, 22. Mai 2019 11:27:25 CEST Martin Reinecke wrote: > Still, I would need a way to determine how long the vectors actually > are. But it would probably be enough to measure this at runtime then. FWIW, something that took me way too long to figure out: You can use vector builtins very conveniently in C++ with the traits I define at https:// github.com/VcDevel/std-simd/blob/59e6348a9d34b4ef4f5ef1fc4f423dd75e1987f3/ experimental/bits/simd.h#L925 (ignore _SimdWrapper: I'm working on phasing it out after I discovered the _VectorTraits solution). You'll need __vector_type, __is_vector_type and _VectorTraits and then you can write: template > void f(T x) { using element_type = typename VT::value_type; constexpr auto N = VT::_S_width; ... } f(x) only participates in overload resolution if x is a vector builtin type because otherwise VectorTraits leads to a substitution failure (SFINAE). -- ── Dr. Matthias Kretzhttps://kretzfamily.de GSI Helmholtzzentrum für Schwerionenforschung https://gsi.de SIMD easy and portable https://github.com/VcDevel/Vc ──
Re: -Wformat-diag: floating-point or floating point?
On 21/05/2019 21:18, Bill Schmidt wrote: > On 5/21/19 11:47 AM, Martin Sebor wrote: >> The GCC coding style says to use "floating-point" as an adjective >> rather than "floating point." After enhancing the -Wformat-diag >> checker to detect this I found a bunch of uses of the latter, such >> as in: >> >> gcc/c/c-decl.c:10944 >> gcc/c/c-parser.c:9423, 9446, 9450, etc. >> gcc/convert.c:418, 422 >> gcc/cp/call.c:5070 >> gcc/cp/cvt.c:886 >> >> Before I fix them all and adjust the tests, I want to make sure >> we really want to follow this rule. The C standard uses both >> interchangeably. With just one exception, the C++ standard uses >> the hyphenated form. > The hyphenated form is correct English, so I certainly prefer it. :-) > It's not quite as simple as that. Hyphens should be used to make it clear what is the adjective and what is the noun: A floating-point number (hyphenated) is a number with a floating point (no hyphen). In the first case 'floating-point' is the adjective and qualifies number. In the second case 'floating' is the adjective and qualifies 'point'. But this is English, so there are probably some exceptions even then - but not in this case, I think. :-) R.
eud7
微1365一09一04255
Re: mfentry and Darwin.
Hi Uros, > On 21 May 2019, at 19:36, Uros Bizjak wrote: > > On Tue, May 21, 2019 at 6:15 PM Iain Sandoe wrote: >> >> It seems to me that (even if it was working “properly”, which it isn't) >> ‘-mfentry’ would break ABI on Darwin for both 32 and 64b - which require >> 16byte stack alignment at call sites. >> >> For Darwin, the dynamic loader enforces the requirement when it can and will >> abort a program that tries to make a DSO linkage with the stack in an >> incorrect alignment. We previously had a bug against profiling caused by >> exactly this issue (but when the mcount call was in the post-prologue >> position). >> >> Actually, I’m not sure why it’s not an issue for other 64b platforms that >> use the psABI (AFAIR, it’s only the 32b case that’s Darwin-specific). > > The __fentry__ in glibc is written as a wrapper around the call to > __mcount_internal, and is written in such a way that it compensates > stack misalignment in a call to __mcount_internal. __fentry__ survives > stack misalignment, since no xmm regs are saved to the stack in the > function. Well, we can’t change Darwin’s libc to do something similar (and anyway the dynamic loader would also need to know that this was a special as well to avoid aborting the exe). ... however we could do a dodge where some shim code was inserted into any TU that used mfentry to redirect the call to an ABI-compliant launchpad… etc. etc. It seems we can’t instrument “exactly at the entry” .. only “pretty close to it”. >> Anyway, my current plan is to disable mfentry (for Darwin) - the alternative >> might be some kind of “almost at the start of the function, but needing some >> stack alignment change”, >> >> I’m interested in if you know of any compelling use-cases that would make it >> worth finding some work-around instead of disabling. > > Unfortunately, not from the top of my head… Well, I can’t either, so for now I’m going to make a patch to disable it (since it’s not fully working in any case) .. if anyone screams that there’s a major reduction in functionality - we can investigate some scheme as mentioned above. thanks Iain
Re: -Wformat-diag: floating-point or floating point?
On 5/22/19 5:19 AM, Richard Earnshaw (lists) wrote: > On 21/05/2019 21:18, Bill Schmidt wrote: >> On 5/21/19 11:47 AM, Martin Sebor wrote: >>> The GCC coding style says to use "floating-point" as an adjective >>> rather than "floating point." After enhancing the -Wformat-diag >>> checker to detect this I found a bunch of uses of the latter, such >>> as in: >>> >>> gcc/c/c-decl.c:10944 >>> gcc/c/c-parser.c:9423, 9446, 9450, etc. >>> gcc/convert.c:418, 422 >>> gcc/cp/call.c:5070 >>> gcc/cp/cvt.c:886 >>> >>> Before I fix them all and adjust the tests, I want to make sure >>> we really want to follow this rule. The C standard uses both >>> interchangeably. With just one exception, the C++ standard uses >>> the hyphenated form. >> The hyphenated form is correct English, so I certainly prefer it. :-) >> > It's not quite as simple as that. Hyphens should be used to make it > clear what is the adjective and what is the noun: > >A floating-point number (hyphenated) is a number with a >floating point (no hyphen). > > In the first case 'floating-point' is the adjective and qualifies > number. In the second case 'floating' is the adjective and qualifies > 'point'. > > But this is English, so there are probably some exceptions even then - > but not in this case, I think. :-) English is always fun, agreed -- Martin cited the requirement to use "floating-point" when it's used as an adjective, which is certainly correct. There's a more interesting question around cavalier usage such as, "We should use floating point." I would argue that there is an implied noun "arithmetic" modified here, so this should also be hyphenated, but I daresay there would be people on both sides of this one... This is why grammar police usually die from friendly fire. :-) Bill > > R. >
Re: -Wformat-diag: floating-point or floating point?
On 22/05/2019 13:17, Bill Schmidt wrote: > On 5/22/19 5:19 AM, Richard Earnshaw (lists) wrote: >> On 21/05/2019 21:18, Bill Schmidt wrote: >>> On 5/21/19 11:47 AM, Martin Sebor wrote: The GCC coding style says to use "floating-point" as an adjective rather than "floating point." After enhancing the -Wformat-diag checker to detect this I found a bunch of uses of the latter, such as in: gcc/c/c-decl.c:10944 gcc/c/c-parser.c:9423, 9446, 9450, etc. gcc/convert.c:418, 422 gcc/cp/call.c:5070 gcc/cp/cvt.c:886 Before I fix them all and adjust the tests, I want to make sure we really want to follow this rule. The C standard uses both interchangeably. With just one exception, the C++ standard uses the hyphenated form. >>> The hyphenated form is correct English, so I certainly prefer it. :-) >>> >> It's not quite as simple as that. Hyphens should be used to make it >> clear what is the adjective and what is the noun: >> >>A floating-point number (hyphenated) is a number with a >>floating point (no hyphen). >> >> In the first case 'floating-point' is the adjective and qualifies >> number. In the second case 'floating' is the adjective and qualifies >> 'point'. >> >> But this is English, so there are probably some exceptions even then - >> but not in this case, I think. :-) > > English is always fun, agreed -- Martin cited the requirement to use > "floating-point" when it's used as an adjective, which is certainly correct. > > There's a more interesting question around cavalier usage such as, > "We should use floating point." I would argue that there is an implied > noun "arithmetic" modified here, so this should also be hyphenated, > but I daresay there would be people on both sides of this one... I would argue that leaving out "arithmetic" is the error. :-) > > This is why grammar police usually die from friendly fire. :-) > Sticking your head above the parapet is always fraught with danger :) R.
Re: -Wformat-diag: floating-point or floating point?
On 22/05/2019 13:17, Bill Schmidt wrote: > "We should use floating point." In fact, since we're compiler folk, we just say it's undefined behaviour and optimize it to "." :-P R.
[AArch64 ELF ABI] Vector calls and lazy binding on AArch64
The lazy binding code of aarch64 currently only preserves q0-q7 of the fp registers, but for an SVE call [AAPCS64+SVE] it should preserve p0-p3 and z0-z23, and for an AdvSIMD vector call [VABI64] it should preserve q0-q23. (Vector calls are extensions of the base PCS [AAPCS64].) A possible fix is to save and restore the additional register state in the lazy binding entry code, this was discussed in https://sourceware.org/ml/libc-alpha/2018-08/msg00017.html the main objections were (1) Linux may optimize the kernel entry code for processes that don't use SVE, so lazy binding should avoid accessing SVE registers. (2) If this is fixed in the dynamic linker, vector calls will not be backward compatible with old glibc. (3) The saved SVE register state can be large (> 8K), so binaries that work today may run out of stack space on an SVE system during lazy binding (which can e.g. happen in a signal handler on a tiny stack). and the proposed solution was to force bind now semantics for vector functions e.g. by not calling them via PLT. This turned out to be harder than I expected. I no longer think (1) and (2) are critically important, but (3) is a correctness issue which is hard to argue away (would require larger stack allocations to accommodate the worst case stack size increase, but the stack allocation is not always under the control of glibc, so it cannot provide strict guarantees). Some approaches to make symbols "bind now" were discussed at https://groups.google.com/forum/#!topic/generic-abi/Bfb2CwX-u4M The ABI change draft is below the notes, it requires marking symbols in the ELF symbol table that follow the vector PCS (or other variant PCS conventions). This is most relevant to dynamic linkers with lazy binding support and to ELF linkers targeting AArch64, but assemblers will need to be updated too. Note 1: the dynamic linker may have to run user code during lazy binding because of ifunc resolvers, so it cannot avoid clobbering fp regs. Note 2: the tlsdesc entry is also affected by (3), so either the the initial DTV setup should avoid clobbering fp regs or the SVE register state should not be callee-preserved by the tlsdesc call ABI (the latter was chosen, which is backward compatible with old dynamic linkers, but tls access from SVE code is as expensive as an extern call now: the caller has to spill). Note 3: signal frame and SVE register spills in code using SVE can also lead to variable stack usage (AT_MINSIGSZTKSZ was introduced to address the former issue on linux) so it is a valid approach to just increase min stack size limits on aarch64 compared to other targets (this is less invasive, but does not fix old binaries). Note 4: the proposal requires marking symbols in asm and elf objects, so it is not compatible with existing tooling (old as or ld cannot create valid vector function symbol references or definitions) and it is only effective with a new dynamic linker. Note 5: -fno-plt style code generation for vector function calls might have worked too, but on aarch64 it requires compiler and linker changes to avoid PLT in position dependent code when that is emitted for the sake of pointer equality. It also requires tightening the ABI to ensure the static linker does not introduce PLT when processing certain static relocations. This approach would generate suboptimal static linked code (the no-plt code is hard to relax into direct calls on aarch64) fragile (easy to accidentally introduce a PLT) and hard to diagnose. Note 6: the proposed solution applies to both SVE calls and AdvSIMD vector calls, even though some issues only apply to SVE. Note 7: a separate dynamic linker entry point for variant PCS calls may be introduced (requires further ELF changes for a PLT0 like stub) or the dynamic linker may decide to always preserve all registers or decide to always bind symbols at load time. AAELF64: in the Symbol Table section add st_other Values The st_other member of a symbol table entry specifies the symbol's visibility in the lowest 2 bits. The top 6 bits are unused in the generic ELF ABI [SCO-ELF], and while there are no values reserved for processor-specific semantics, many other architectures have used these bits. The defined processor-specific st_other flag values are listed in Table 4-5-1. Table 4-5-1, Processor specific st_other flags ++--+-+ |Name| Mask | Comment | ++--+-+ |STO_AARCH64_VARIANT_PCS | 0x80 | Thefunction | || | associated with the | || | symbol may follow a | || | variant procedure | || | call standard with | |
Re: -Wformat-diag: floating-point or floating point?
On 5/22/19 6:27 AM, Richard Earnshaw (lists) wrote: On 22/05/2019 13:17, Bill Schmidt wrote: On 5/22/19 5:19 AM, Richard Earnshaw (lists) wrote: On 21/05/2019 21:18, Bill Schmidt wrote: On 5/21/19 11:47 AM, Martin Sebor wrote: The GCC coding style says to use "floating-point" as an adjective rather than "floating point." After enhancing the -Wformat-diag checker to detect this I found a bunch of uses of the latter, such as in: gcc/c/c-decl.c:10944 gcc/c/c-parser.c:9423, 9446, 9450, etc. gcc/convert.c:418, 422 gcc/cp/call.c:5070 gcc/cp/cvt.c:886 Before I fix them all and adjust the tests, I want to make sure we really want to follow this rule. The C standard uses both interchangeably. With just one exception, the C++ standard uses the hyphenated form. The hyphenated form is correct English, so I certainly prefer it. :-) It's not quite as simple as that. Hyphens should be used to make it clear what is the adjective and what is the noun: A floating-point number (hyphenated) is a number with a floating point (no hyphen). In the first case 'floating-point' is the adjective and qualifies number. In the second case 'floating' is the adjective and qualifies 'point'. But this is English, so there are probably some exceptions even then - but not in this case, I think. :-) English is always fun, agreed -- Martin cited the requirement to use "floating-point" when it's used as an adjective, which is certainly correct. There's a more interesting question around cavalier usage such as, "We should use floating point." I would argue that there is an implied noun "arithmetic" modified here, so this should also be hyphenated, but I daresay there would be people on both sides of this one... I would argue that leaving out "arithmetic" is the error. :-) I agree. Unfortunately, there are a few cases like that among the diagnostics that my script has already fixed: decimal floating point not supported comparing floating point with %<==%> or % is unsafe ISO C does not support decimal floating point They probably should read decimal floating point types not supported comparing floating-point values with %<==%> or % is unsafe ISO C does not support decimal floating point types I think they can be adjusted later if we think it's important, after the checker is finalized and committed. I don't want to complicate it too much by having to differentiate between adjectives and nouns. The vast majority of the "floating point" instances is has found are adjectives. Martin This is why grammar police usually die from friendly fire. :-) Sticking your head above the parapet is always fraught with danger :) R.
Re: -Wformat-diag: floating-point or floating point?
On 5/22/19 9:58 AM, Martin Sebor wrote: > On 5/22/19 6:27 AM, Richard Earnshaw (lists) wrote: >> On 22/05/2019 13:17, Bill Schmidt wrote: >>> On 5/22/19 5:19 AM, Richard Earnshaw (lists) wrote: On 21/05/2019 21:18, Bill Schmidt wrote: > On 5/21/19 11:47 AM, Martin Sebor wrote: >> The GCC coding style says to use "floating-point" as an adjective >> rather than "floating point." After enhancing the -Wformat-diag >> checker to detect this I found a bunch of uses of the latter, such >> as in: >> >> gcc/c/c-decl.c:10944 >> gcc/c/c-parser.c:9423, 9446, 9450, etc. >> gcc/convert.c:418, 422 >> gcc/cp/call.c:5070 >> gcc/cp/cvt.c:886 >> >> Before I fix them all and adjust the tests, I want to make sure >> we really want to follow this rule. The C standard uses both >> interchangeably. With just one exception, the C++ standard uses >> the hyphenated form. > The hyphenated form is correct English, so I certainly prefer it. :-) > It's not quite as simple as that. Hyphens should be used to make it clear what is the adjective and what is the noun: A floating-point number (hyphenated) is a number with a floating point (no hyphen). In the first case 'floating-point' is the adjective and qualifies number. In the second case 'floating' is the adjective and qualifies 'point'. But this is English, so there are probably some exceptions even then - but not in this case, I think. :-) >>> >>> English is always fun, agreed -- Martin cited the requirement to use >>> "floating-point" when it's used as an adjective, which is certainly >>> correct. >>> >>> There's a more interesting question around cavalier usage such as, >>> "We should use floating point." I would argue that there is an implied >>> noun "arithmetic" modified here, so this should also be hyphenated, >>> but I daresay there would be people on both sides of this one... >> >> I would argue that leaving out "arithmetic" is the error. :-) > > I agree. Unfortunately, there are a few cases like that among > the diagnostics that my script has already fixed: > > decimal floating point not supported > comparing floating point with %<==%> or % is unsafe > ISO C does not support decimal floating point > > They probably should read > > decimal floating point types not supported > comparing floating-point values with %<==%> or % is unsafe > ISO C does not support decimal floating point types "decimal floating point types" does use "floating-point" to modify "types", so if you change those they should probably remain hyphenated. Technically the whole phrase "decimal floating point" modifies types together, and should be hyphenated together, but that just looks fussy and isn't common practice. None of those details are going to solve world hunger. :-) Thanks for fixing the general problem! For the edge cases, Richard's optimization looks better and better. :-P Bill > > I think they can be adjusted later if we think it's important, > after the checker is finalized and committed. I don't want to > complicate it too much by having to differentiate between > adjectives and nouns. The vast majority of the "floating point" > instances is has found are adjectives. > > Martin > > >>> This is why grammar police usually die from friendly fire. :-) >>> >> >> Sticking your head above the parapet is always fraught with danger :) >> >> >> R. >> >
Re: [AArch64 ELF ABI] Vector calls and lazy binding on AArch64
* Szabolcs Nagy: > AAELF64: in the Symbol Table section add > > st_other Values > The st_other member of a symbol table entry specifies the symbol's > visibility in the lowest 2 bits. The top 6 bits are unused in the > generic ELF ABI [SCO-ELF], and while there are no values reserved for > processor-specific semantics, many other architectures have used these > bits. > > The defined processor-specific st_other flag values are listed in > Table 4-5-1. > > Table 4-5-1, Processor specific st_other flags > ++--+-+ > |Name| Mask | Comment | > ++--+-+ > |STO_AARCH64_VARIANT_PCS | 0x80 | Thefunction | > || | associated with the | > || | symbol may follow a | > || | variant procedure | > || | call standard with | > || | different register | > || | usage convention. | > ++--+-+ > > A symbol table entry that is marked with the STO_AARCH64_VARIANT_PCS > flag set in its st_other field may be associated with a function that > follows a variant procedure call standard with different register > usage convention from the one defined in the base procedure call > standard for the list of argument, caller-saved and callee-saved > registers [AAPCS64]. The rules in the Call and Jump relocations > section still apply to such functions, and if a subroutine is called > via a symbol reference that is marked with STO_AARCH64_VARIANT_PCS > then code that runs between the calling routine and called subroutine > must preserve the contents of all registers except IP0, IP1 and the > condition code flags [AAPCS64]. Can you clarify if there has to be a valid stack at this point which can be used during the call transfer? What about the stack alignment requirement? Thanks, Florian
Re: [AArch64 ELF ABI] Vector calls and lazy binding on AArch64
On 22/05/2019 16:06, Florian Weimer wrote: > * Szabolcs Nagy: > >> AAELF64: in the Symbol Table section add >> >> st_other Values >> The st_other member of a symbol table entry specifies the symbol's >> visibility in the lowest 2 bits. The top 6 bits are unused in the >> generic ELF ABI [SCO-ELF], and while there are no values reserved for >> processor-specific semantics, many other architectures have used these >> bits. >> >> The defined processor-specific st_other flag values are listed in >> Table 4-5-1. >> >> Table 4-5-1, Processor specific st_other flags >> ++--+-+ >> |Name| Mask | Comment | >> ++--+-+ >> |STO_AARCH64_VARIANT_PCS | 0x80 | Thefunction | >> || | associated with the | >> || | symbol may follow a | >> || | variant procedure | >> || | call standard with | >> || | different register | >> || | usage convention. | >> ++--+-+ >> >> A symbol table entry that is marked with the STO_AARCH64_VARIANT_PCS >> flag set in its st_other field may be associated with a function that >> follows a variant procedure call standard with different register >> usage convention from the one defined in the base procedure call >> standard for the list of argument, caller-saved and callee-saved >> registers [AAPCS64]. The rules in the Call and Jump relocations >> section still apply to such functions, and if a subroutine is called >> via a symbol reference that is marked with STO_AARCH64_VARIANT_PCS >> then code that runs between the calling routine and called subroutine >> must preserve the contents of all registers except IP0, IP1 and the >> condition code flags [AAPCS64]. > > Can you clarify if there has to be a valid stack at this point which can > be used during the call transfer? What about the stack alignment > requirement? the intention is to only allow 'register usage convention' to be relaxed compared to the base PCS (which has rules for stack etc), and even the register usage convention has to be compatible with the 'Call and Jump relocations section' which essentially says that veneers inserted by the linker between calls can clobber IP0, IP1 and the condition flags. i.e. a variant pcs function follows the same rules as base pcs, but it may use different caller-/callee-saved/argument regiseters. when SVE pcs is merged into the current AAPCS document, then i hope the 'variant pcs' term used here will be properly specified so the ELF ABI will just refer back to that.
Re: [AArch64 ELF ABI] Vector calls and lazy binding on AArch64
* Szabolcs Nagy: > On 22/05/2019 16:06, Florian Weimer wrote: >> * Szabolcs Nagy: >> >>> AAELF64: in the Symbol Table section add >>> >>> st_other Values >>> The st_other member of a symbol table entry specifies the symbol's >>> visibility in the lowest 2 bits. The top 6 bits are unused in the >>> generic ELF ABI [SCO-ELF], and while there are no values reserved for >>> processor-specific semantics, many other architectures have used these >>> bits. >>> >>> The defined processor-specific st_other flag values are listed in >>> Table 4-5-1. >>> >>> Table 4-5-1, Processor specific st_other flags >>> ++--+-+ >>> |Name| Mask | Comment | >>> ++--+-+ >>> |STO_AARCH64_VARIANT_PCS | 0x80 | Thefunction | >>> || | associated with the | >>> || | symbol may follow a | >>> || | variant procedure | >>> || | call standard with | >>> || | different register | >>> || | usage convention. | >>> ++--+-+ >>> >>> A symbol table entry that is marked with the STO_AARCH64_VARIANT_PCS >>> flag set in its st_other field may be associated with a function that >>> follows a variant procedure call standard with different register >>> usage convention from the one defined in the base procedure call >>> standard for the list of argument, caller-saved and callee-saved >>> registers [AAPCS64]. The rules in the Call and Jump relocations >>> section still apply to such functions, and if a subroutine is called >>> via a symbol reference that is marked with STO_AARCH64_VARIANT_PCS >>> then code that runs between the calling routine and called subroutine >>> must preserve the contents of all registers except IP0, IP1 and the >>> condition code flags [AAPCS64]. >> >> Can you clarify if there has to be a valid stack at this point which can >> be used during the call transfer? What about the stack alignment >> requirement? > > the intention is to only allow 'register usage convention' to be > relaxed compared to the base PCS (which has rules for stack etc), > and even the register usage convention has to be compatible with > the 'Call and Jump relocations section' which essentially says that > veneers inserted by the linker between calls can clobber IP0, IP1 > and the condition flags. > > i.e. a variant pcs function follows the same rules as base pcs, but > it may use different caller-/callee-saved/argument regiseters. > > when SVE pcs is merged into the current AAPCS document, then i hope > the 'variant pcs' term used here will be properly specified so the > ELF ABI will just refer back to that. My concern is that with the current language, it's not clear whether it's possible to use the stack as a scratch area during the call transition, or rely on a valid TCB. I think this is rather underspecified. Thanks, Florian
Re: [AArch64 ELF ABI] Vector calls and lazy binding on AArch64
On 22/05/2019 16:34, Florian Weimer wrote: > * Szabolcs Nagy: > >> On 22/05/2019 16:06, Florian Weimer wrote: >>> * Szabolcs Nagy: >>> AAELF64: in the Symbol Table section add st_other Values The st_other member of a symbol table entry specifies the symbol's visibility in the lowest 2 bits. The top 6 bits are unused in the generic ELF ABI [SCO-ELF], and while there are no values reserved for processor-specific semantics, many other architectures have used these bits. The defined processor-specific st_other flag values are listed in Table 4-5-1. Table 4-5-1, Processor specific st_other flags ++--+-+ |Name| Mask | Comment | ++--+-+ |STO_AARCH64_VARIANT_PCS | 0x80 | Thefunction | || | associated with the | || | symbol may follow a | || | variant procedure | || | call standard with | || | different register | || | usage convention. | ++--+-+ A symbol table entry that is marked with the STO_AARCH64_VARIANT_PCS flag set in its st_other field may be associated with a function that follows a variant procedure call standard with different register usage convention from the one defined in the base procedure call standard for the list of argument, caller-saved and callee-saved registers [AAPCS64]. The rules in the Call and Jump relocations section still apply to such functions, and if a subroutine is called via a symbol reference that is marked with STO_AARCH64_VARIANT_PCS then code that runs between the calling routine and called subroutine must preserve the contents of all registers except IP0, IP1 and the condition code flags [AAPCS64]. >>> >>> Can you clarify if there has to be a valid stack at this point which can >>> be used during the call transfer? What about the stack alignment >>> requirement? >> >> the intention is to only allow 'register usage convention' to be >> relaxed compared to the base PCS (which has rules for stack etc), >> and even the register usage convention has to be compatible with >> the 'Call and Jump relocations section' which essentially says that >> veneers inserted by the linker between calls can clobber IP0, IP1 >> and the condition flags. >> >> i.e. a variant pcs function follows the same rules as base pcs, but >> it may use different caller-/callee-saved/argument regiseters. >> >> when SVE pcs is merged into the current AAPCS document, then i hope >> the 'variant pcs' term used here will be properly specified so the >> ELF ABI will just refer back to that. > > My concern is that with the current language, it's not clear whether > it's possible to use the stack as a scratch area during the call > transition, or rely on a valid TCB. I think this is rather > underspecified. i think that's underspecified in general for normal calls too, currently the glibc dynamic linker assumes it can use some stack space and do various async signal safe operations (some of which may even fail), variant pcs does not change any of this. it only provides a per symbol escape hatch for functions with a bit special call convention, and i plan to use the symbol marking in glibc as 'force bind now for these symbols', because other behaviour may not be forward compatible if the architecture changes again (if lazy binding turns out to be very important for these symbols i'd prefer introducing a second entry point for them instead of checking the elf flags from the entry asm). i'll try to post patches implementing this abi soon.
RTL Layer Paralleling
Greetings All, After getting it some thought I've yet to see any real focus on making the RTL layers or just the architectural backends parallel in nature. Is there a reason for this as my assumption is that after reading parts of the x86 and rs6000 backend code it's not needed. This seems to be the case as most of the profiling only really shows bottle necking at least in my research at the gimple layer. Is this because little time has been put into researching the RTL passes or it's just assumed to have no requirements require to gimple and perhaps post GENERIC/pre GIMPLE passes? Just curious as I've not a RTL or backend expert, Nick
On-Demand range technology [2/5] - Major Components : How it works
*This note will talk about the 4 major components of the prototype and explain how they work together. I will be fairly light on detail just to give an overview, we can delve into whatever details are needed. - Range-ops : Range operations at the statement level - GORI - Generates Outgoing Range Info : Ranges generated by basic blocks - GORI Cache - combining and caching basic-block range info - Ranger - API and query control 1 * Range-ops The first major component is the centralized range operation database. This is where range operations for tree-codes are implemented. The compiler currently has code which can fold ranges, but the new mechanism is a general purpose solver which can solve for other operands. If there are N input/output operands, and we have ranges for N-1, It is often possible to derive the missing range. ie lhs = op1 + op2 The existing code in the compiler can solve for LHS given ranges for OP1 and OP2. This has been extended so that we can also sometimes solve for either op1 or op2 e, ie [20,40] = op1 + [5, 10] ...can calculate that op1 has the range [15, 30] This ability to solve for the other operands provides the ability to calculate ranges in the reverse order we are accustomed to, and is key to enabling the on-demand range approach. a_2 = b_1 - 20 if (a_2 < 40) A conditional jump has an implicit boolean LHS depending on which edge is taken. To evaluate ranges on the TRUE edge of the branch, the LHS is [1,1]. To calculate the range for a_2 we simply solve the equation: [1,1] = a_2 < 40 which provides the answer as [0,39]. Furthermore, substituting this range for a_2 as the LHS of it’s definition statement: [0,39] = b_1 - 20 The same evaluation mechanism can calculate a result for b_1 on the true edge as [20,59]. This is the key feature which allows the rest of the algorithm to work in a general way. All operations are driven from this component, and the only thing required to add support for another tree code is to implement one or more methods for it here, and the rest of the range mechanism will simply work with it. 2 * GORI The second component is the “Generates Outgoing Range Info” engine. This is a basic-block oriented component which determines what ssa-names have ranges created on outgoing edges, and how to calculate those ranges. The last statement in the block is examined to see if it is a branch which generates range info, and then determines which ssa-names in the block can have ranges calculated. It quickly walks the use-def chain from the branch to find other definitions within the block that can impact the branch and could also have their ranges calculated. This summary is then cached for future use. The GORI component also uses this summary info to perform the calculations necessary to determine the outgoing range for any ssa_name which can be determined. For example: c_3 = foo () a_2 = b_1 - 20 If (a_2 < 40) The summary information would indicate that b_1 and a_2 can have their outgoing ranges calculated for this block, and uses the cached information to quickly calculate them when required. The API consists of 2 basic methods, query and calculate: - bool has_edge_range_p (edge, name); - range outgoing_edge_range (edge, name); If a query is made for any other ssa-name, it simply returns false and says this block does not generate a range for that name. This is a key rationale for the summary so we only spend time processing names for blocks that actually have ranges generated. 3 * GORI cache - The third component builds on the basic block GORI component, and adds the ability to traverse the CFG and combine the various ranges provided by each basic block into a cache. It provides both a global-range cache and a range-on-entry cache summarizing the range of an ssa_name on entry to the block. The range-on-entry cache is self filling, and iteratively evaluates back edges. There is a single API entry point which asks for the range on entry, and if the data has not been calculated/cached already, it spawns the following process to calculate the data. : * Walk the CFG from the current block to the definition block, and/or any block which has range-on-entry information already calculated. These end points are the source blocks. * Any non-source blocks visited are added to a worklist to be updated. * Starting with the source blocks, push the known outgoing range from each block to each successor and update their live-on-entry values when they are visited. If the incoming range changes, mark this block’s successors as requiring updating. * Continue iterating until no more updates are required. The ordering is planned by the ranger such that if there are no back edges, it is typically a straightforward single pass. When back edges are involved, the amount of iterating is typically ver
On-Demand range technology [1/5] - Executive Summary
Now that stage 1 has reopened, I’d like to reopen a discussion about the technology and experiences we have from the Ranger project I brought up last year. https://gcc.gnu.org/ml/gcc/2018-05/msg00288.html . (The original wiki pages are now out of date, and I will work on updating them soon.) The Ranger is designed to evaluate ranges on-demand rather than through a top-down approach. This means you can ask for a range from anywhere, and it walks back thru the IL satisfying any preconditions and doing the required calculations. It utilizes a cache to avoid re-doing work. If ranges are processed in a forward dominator order, it’s not much different than what we do today. Due to its nature, the order you process things in has minimal impact on the overall time… You can do it in reverse dominator order and get similar times. It requires no outside preconditions (such as dominators) to work, and has a very simple API… Simply query the range of an ssa_name at any point in the IL and all the details are taken care of. We have spent much of the past 6 months refining the prototype (branch “ssa-range”) and adjusting it to share as much code with VRP as possible. They are currently using a common code base for extracting ranges from statements, as well as simplifying statements. The Ranger deals with just ranges. The other aspects of VRP are intended to be follow on work that integrates tightly with it, but are also independent and would be available for other passes to use. These include: - Equivalency tracking - Relational processing - Bitmask tracking We have implemented a VRP pass that duplicates the functionality of EVRP (other than the bits mentioned above), as well as converted a few other passes to use the interface.. I do not anticipate those missing bits having a significant impact on the results. The prototype branch it quite stable and can successfully build and test an entire Fedora distribution (9174 packages). There is an issue with switches I will discuss later whereby the constant range of a switch edge is not readily available and is exponentially expensive to calculate. We have a design to address that problem, and in the common case we are about 20% faster than EVRP is. When utilized in passes which only require ranges for a small number of ssa-names we see significant improvements. The sprintf warning pass for instance allows us to remove the calculations of dominators and the resulting forced walk order. We see a 95% speedup (yes, 1/20th of the overall time!). This is primarily due to no additional overhead and only calculating the few things that are actually needed. The walloca and wrestrict passes are a similar model, but as they have not been converted to use EVRP ranges yet, we don’t see similar speedups there. That is the executive summary. I will go into more details of each major thing mentioned in follow on notes so that comments and discussions can focus on one thing at a time. We think this approach is very solid and has many significant benefits to GCC. We’d like to address any concerns you may have, and work towards finding a way to integrate this model with the code base during this stage 1. Comments and feedback always welcome! Thanks Andrew
On-Demand range technology [3/5] - The Prototype
There is a functioning prototype in branch “ssa-range” which is a proof of concept that the approach is functional as well as quick, and can be used to answer questions which come up regarding what it can and can’t do. Our last merge was on April 13th, so it's fairly up to date. We have implemented a flexible range class (irange) which allows for multiple subranges to represent a range, and which can be extended in the future to types other than integral. We use this throughout, but it could be replaced in the ranger with any similar API. Conversion routines are also provided to convert from irange to value_range and value_range to irange. A full set of tree_code range-op routines are implemented. We have commoned as much code as possible with the existing VRP range extraction code. Also, we have added additional code for calculating the other operands from a known result in numerous cases. The code base in VRP has been modified (via a flag) to - Work completely with the native value_range like it does today. - Use irange and the range-ops component under the covers to extract ranges. Requests in VRP are then converted from value_ranges to irange, called into the range-op routines, and then converted back to value_range for VRP/EVRP’s use. - Do operations both ways and compare the results to make sure both agree on the range, and trap if they do not. The branch defaults to the compare and trap mode to ensure everything is working correctly. This has been our correctness model for statement range extraction and was active during the Fedora package builds. The only time we disabled it was to do performance runs vs RVRP, and were looking at both branch and trunk times for EVRP and VRP. Of note, we drop all symbolics in ranges to VARYING on everything except PLUS and MINUS, which we leave as native calculations if there are symbolics present. More on symbolics later. A VRP like pass called RVRP has been implemented. - The vr-values statement simplification code has been factored out to be range agnostic, meaning that these routines can operate on either value_range or irange. Thus, we are using a common code base to perform statement simplification as well. - For complete compatibility with EVRP, the RVRP pass builds dominators and instantiates the SCEV loop information so we have loop range info available. RVRP does not need this info to run, but would miss some of the test cases which depend on loop ranges. - RVRP is set up to demonstrate it can process the IL in multiple directions and bootstraps/passes all tests in all directions. * Dominator order * Post-dominator order * BB1 thru BBn * BBn thru BB1 * branch-only mode where only branches at the end of each BB are examined for folding opportunities 4 additional passes have been converted to use the ranger model: - sprintf - removed the dominator building/walking - warn alloca - replaced calls to get global ranges with calls that now return context sensitive ranges. - warn restrict - just replaced EVRP range calls with ranger calls. - backwards threader - enhanced to use contextual range information to make additional threading decisions. Symbolic Ranges One big concern last year expressed was my decision to abolish symbolic ranges. I continue to maintain that there is no need to track the range of x_2 as [y_3 + 5, MAX] for x_2 = y_3 + 5. All you need to do is look at the definition of x_2, and the same information is exposed right there in the IL. If one requires the symbolic information, the same on-demand process could lookup that information as well. This in turn, makes the code for ranges much simpler, easier to maintain, and less likely to introduce bugs. We have found through our prototype that symbolics in ranges are not nearly as prevalent as one would think. During the work to share a common code base with VRP, we found that symbolic ranges are irrelevant for everything except PLUS_EXPR and MINUS_EXPR. The shared code in our branch drops symbolics to varying immediately for everything else, and it has no impact on EVRP, or VRP or any tests we can find anywhere. Furthermore, we never trapped while comparing ranges generated by VRP versus generating them with range-ops which drops symbolics to varying. We tried modifying VRP such that we don’t even create symbolic endpoints, but rather revert to VARYING always. We can find no test case that fails because a range is not calculated properly due to resolving these endpoints. There are a few that fail due to the symbolic being used to help track relationals.. Ie x_2 = y_3 + 5 If (x_2 > y_3) // can be folded since we know x_2 must be < y_3 VRP generates a range for x of [ y_3+5, MAX ] and at various points uses that to infer a relational or equivalence. Ie, it becomes easy to tell that the condition must always be true
On-Demand range technology [4/5] - Performance results
We have done extensive performance analysis to help address concerns about the nature of an on-demand model. LLVM made an attempt at something similar, but suffered from significant performance issues they could not solve with their approach. This approach is not the same, and we have seen no sign of pathological cases which are problematic. I have trolled bugzilla looking for large problematic test cases, and have tried a couple of test cases from LLVM which dberlin pointed out they found to be problematic. To date, I have found nothing that shows any kind of performance problem. I am more than happy to entertain whatever cases y’all might want to throw my way. For a test of robustness, we have built a complete Fedora distribution consisting of 9174 packages with the ranger branch. All packages except 3 build successfully and pass the relevant regression tests. It appears 2 of them are related to RVRP and are still under analysis. Our primary performance testbase to date has been the compiler itself. We compile 242 .ii files from a stage1 compiler build, and compare times to EVRP. VRP is quite a bit slower, and although we do have an iterative update infrastructure, comparisons aren’t quite fair since we don’t do all equivalence and bitmask operations it does yet. Before going into the numbers, I would like to visit a minor issue we have with switches. RVRP works from the bottom up, so in order to evaluate a range, it begins by getting the constant range for the LHS of the branch from the edge. For a conditional it is trivially [0,0] or [1,1] depending on TRUE or FALSE edge. For a switch, it turns out GCC gimple representation has no simple way to figure this out. As a result, we need to loop over every edge in the switch and then union together all the cases which share that edge, or in the default case, intersect out all the other cases. This turns out to be *very* time consuming in tests cases with very large switches, somewhere in the vicinity of O(n^3). Ugg. So the ranger incurs a fair amount of overhead trying to evaluate, and re-evaluate these constant edges. There are ways to address this… but for now we will present performance numbers with different switch configurations, each of the 5 configurations listed here: 1 - Calculating ranges outright from the stock branch. 2 - Timers adjusted to exclude switch edge calculation code (i.e. pretend the range is available on the edge like it is with TRUE and FALSE. 3 - Do not process switches. We spend extra time on switches because we always attempt to calculate ranges very precisely as if we had infinite precision. There is room for trimming outcomes here, but we have made no attempt yet. 4 - Just like EVRP, RVRP currently includes building the dominators and integrating calls into SCEV at the beginning of each block to see if there are any loop range refinements. The ranger has no need for this to operate, and many passes will not care. So we produce a 4th number for RVRP where we don’t build any of the infrastructure it doesn’t need. 5 - RVRP can also run in conditional-only mode. Rather than walking the entire IL trying to resolve every range, it can simply look at the last statement of every block asking if the branch can be folded. This gets a lot of what vrp gets that affects the CFG and could be utilized in either lower optimization levels, or as VRP if we can push all the other activities it does into appropriate optimizations (such as making CCP range aware). **NOTE**: This *still* calculates switch ranges, so includes that slow down. All times are with a release configured compiler. Out of the 242 files, pretty much across the board in all 4 sets of figures, RVRP was faster in about 90% of the cases, and slower in the other 10%, resulting in the following cumulative totals. Overall (242 files) 1 - Raw RVRP 22% slower 2 - No edge calculation(1) 4.5% slower 3 - No switch processing(2) 9.5% faster 4 - No dominators(3) 16% faster 5 - Conditionals (including switches) only 4% faster These numbers indicate that large switches are the primary cause when RVRP is slower. We have various approaches we could use to address this. Removing the time spent building unnecessary dominators shows a significant improvement as well. We also have the results for the time spent in the passes we converted: Overall (242 files) 1 - -wrestrict 19% faster 2 - -wprintf 95% faster 3 - -walloca 1% slower -wrestrict has the dominator walk removed since it is no longer needed, and simply calls into a ranger to get the range. -wprintf has had the dominator build removed, as well as the EVRP range walk. It really benefits from very few ranges being requested… so the on demand approach is a big win here
On-Demand range technology [5/5] - Looking to the future.
A primary goal of this approach is to try to pull the various aspects of VRP apart and make them individually viable so they can be used at appropriate places as needed. The various components of VRP were identified as: - Ranges - Relational queries - Equivalencies - Bitmask tracking - Symbolic range endpoint resolution This prototype implementation tackles only the range component. It makes ranges easily accessible from anywhere in the compiler. I have plans for addressing these components within the same model, but maintaining their independence. This should make maintenance easier, less error prone, and ultimately be far more flexible as other passes can utilize whichever aspects they need. Symbolic range endpoint resolution -- I touched on this in the prototype section, and maintain that the only requirement for symbols falls under the equivalence and relational tracking. x_2 = y_3 + 5 If (x_2 > y_3) Ranges themselves in VRP are eventually resolved to a constant for export to the global range table. At this point, whatever range is known for the symbolic is substituted into the value_range, and any expression resolved to come up with the final non-symbolic range. X_2 = [y_3 + 5, MAX] If y_3 evaluates to [20, 30], then x_2 is resolved as [25, MAX]. The ranger does this by default on the fly due to its nature. When the range of x_2 is requested the first time, it evaluates y_3 , comes up with the same [20, 30] range for y_3, and evaluates it to [25,max] immediately. The facility is there to reevaluate the range if the range of y_3 changes, but to this point it has not been needed. Typically it involves y_3 derived in some way from a back edge, and also being derived by yet another ssa-name from a different back edge. So, not super common. However, I do plan to get to this eventually to enable those re-calculations. For the protype, it has not been deemed critical since EVRP doesn't even do back edges. Equivalencies and other Relationals -- The relationship between ssa names are the primary use of symbols in ranges today, but the actual property of relations and equivalencies has little to do with ranges. I propose that we utilize the exact same model as the range operation database to track relationships. Both equivalencies and relationals can be combined as “==” and “!=” is merely another relation. Each tree code has a query to ask for the relationship between any of its operands. Ie: y_2 = x_3 j_4 = y_2 + 6 If (j_4 > x_3) Knowing the ranges of j_4 and x_3 don’t really help resolve the condition. If x_3 is varying, or even a non-constant, we know nothing at all, at least from a range perspective. Applying the same calculation model the ranger uses from a relational point of view, range-ops can be given a relational interface in which each tree code can evaluate the relation between its operands. A copy would return “==” for the relation between the LHS and op1, so we’d have the relation y_2 == x_3 Operator plus would look at its operands, and be able to indicate J_4 < Y_2 because operand 2 is a positive constant. The branch is the one we care about, and a query would be made for the relation between j_4 and x_3. By combining the relations that feed it, we’d get the j_4 < (y_2 == x_3), and the relational result would be j_4 < x_3. When applied to (j_4 > x_3) the result is false. So the relational query would be able to answer the question without ever looking at a range, although if a range is available, it may help refine the answer. The lookup process is identical to the way ranges are currently handled, which means the same query infrastructure can be leveraged and used independently or in concert with ranges. This would also benefit from not carrying information around unless it is requested/required. Currently all equivalences must be stored in case we need to know if there is an equivalency. Just like with ranges, this model would have no need to even look at an equivalency unless there is an actual need to know. Clearly there is work to do, but this has a lot of potential as a follow up to the range work since it uses the same infrastructure. Any pass could cheaply ask about the equivalence/relation between any 2 ssa_names. Bitmask tracking Not to sound like a broken record, but the exact same process can also be applied to bitmasks. X_2 = y_1 | 0x01 If (x_2 == 2) // can never be true Bitwise operators, as well as other operators like *, /, shifts, etc can calculate bitmasks in exactly the same way ranges are calculated. I also considered adding them as an element of the range class, but that would complicate the range class, and I maintain that keeping this all independant is better from both a maintainability and correctne