Re: [PATCH, libstdc++] Improve the performance of std::uniform_int_distribution (fewer divisions)
(This is a revised patch proposal. I am revising both the description and the code itself.) Even on recent processors, integer division is relatively expensive. The current implementation of std::uniform_int_distribution typically requires two divisions by invocation: // downscaling const __uctype __uerange = __urange + 1; // __urange can be zero const __uctype __scaling = __urngrange / __uerange; const __uctype __past = __uerange * __scaling; do __ret = __uctype(__urng()) - __urngmin; while (__ret >= __past); __ret /= __scaling; We can achieve the same algorithmic result with at most one division, and typically no division at all without requiring more calls to the random number generator. This was recently added to Swift (https://github.com/apple/swift/pull/25286) The main challenge is that we need to be able to compute the "full" product. E.g., given two 64-bit integers, we want the 128-bit result; given two 32-bit integers we want the 64-bit result. This is fast on common processors. The 128-bit product is not natively supported in C/C++ but can be achieved with the __int128 extension when it is available. The patch checks for __int128 support; when support is lacking, we fallback on the existing approach which uses two divisions per call. For example, if we replace the above code by the following, we get a substantial performance boost on skylake microarchitectures. E.g., it can be twice as fast to shuffle arrays of 1 million elements (e.g., using the followingbenchmark: https://github.com/lemire/simple_cpp_shuffle_benchmark ) unsigned __int128 __product = (unsigned __int128)(__uctype(__urng()) - __urngmin) * uint64_t(__uerange); uint64_t __lsb = uint64_t(__product); if(__lsb < __uerange) { uint64_t __threshold = -uint64_t(__uerange) % uint64_t(__uerange); while (__lsb < __threshold) { __product = (unsigned __int128)(__uctype(__urng()) - __urngmin) * (unsigned __int128)(__uerange); __lsb = uint64_t(__product); } } __ret = __product >> 64; Included is a patch that would bring better performance (e.g., 2x gain) to std::uniform_int_distribution in some cases. Here are some actual numbers... With this patch: std::shuffle(testvalues, testvalues + size, g) : 7952091 ns total, 7.95 ns per input key Before this patch: std::shuffle(testvalues, testvalues + size, g) : 14954058 ns total, 14.95 ns per input key Compiler: GNU GCC 8.3 with -O3, hardware: Skylake (i7-6700). Furthermore, the new algorithm is unbiased, so the randomness of the result is not affected. I ran both performance and biases tests with the proposed patch. This patch proposal was improved following feedback by Jonathan Wakely. An earlier version used the __uint128_t type, which is widely supported but not used in the C++ library, instead we now use unsigned __int128. Furthermore, the previous patch was accidentally broken: it was not computing the full product since a rhs cast was missing. These issues are fixed and verified. Reference: Fast Random Integer Generation in an Interval, ACM Transactions on Modeling and Computer Simulation 29 (1), 2019 https://arxiv.org/abs/1805.10941 Index: libstdc++-v3/include/bits/uniform_int_dist.h === --- libstdc++-v3/include/bits/uniform_int_dist.h(revision 276183) +++ libstdc++-v3/include/bits/uniform_int_dist.h(working copy) @@ -33,7 +33,8 @@ #include #include - +#include +#include namespace std _GLIBCXX_VISIBILITY(default) { _GLIBCXX_BEGIN_NAMESPACE_VERSION @@ -239,18 +240,61 @@ = __uctype(__param.b()) - __uctype(__param.a()); __uctype __ret; - - if (__urngrange > __urange) + if (__urngrange > __urange) { - // downscaling - const __uctype __uerange = __urange + 1; // __urange can be zero - const __uctype __scaling = __urngrange / __uerange; - const __uctype __past = __uerange * __scaling; - do - __ret = __uctype(__urng()) - __urngmin; - while (__ret >= __past); - __ret /= __scaling; - } + const __uctype __uerange = __urange + 1; // __urange can be zero +#if _GLIBCXX_USE_INT128 == 1 +if(sizeof(__uctype) == sizeof(uint64_t) and + (__urngrange == numeric_limits::max())) +{ + // 64-bit case + // reference: Fast Random Integer Generation in an Interval + // ACM Transactions on Modeling and Computer Simulation 29 (1), 2019 + // https://arxiv.org/abs/1805.10941 + unsigned __int128 __product = (unsigned __int128)(__uctype(__urng()) - __urngmin) * uint64_t(__uerange); + uint64_t __lsb = uint64_t(__product); + if(__lsb < __uerange) + { +uint64_t __threshold = -uint64_t(__uerange) % uint64_t(__uerange); +whi
[PATCH, libstdc++] Improve the performance of std::uniform_int_distribution (fewer divisions)
Even on recent processors, integer division is relatively expensive. The current implementation of std::uniform_int_distribution typically requires two divisions by invocation: // downscaling const __uctype __uerange = __urange + 1; // __urange can be zero const __uctype __scaling = __urngrange / __uerange; const __uctype __past = __uerange * __scaling; do __ret = __uctype(__urng()) - __urngmin; while (__ret >= __past); __ret /= __scaling; We can achieve the same algorithmic result with at most one division, and typically no division at all without requiring more calls to the random number generator. This was recently added to Swift (https://github.com/apple/swift/pull/25286) The main challenge is that we need to be able to compute the "full" product. E.g., given two 64-bit integers, we want the 128-bit result; given two 32-bit integers we want the 64-bit result. This is fast on common processors. The 128-bit product is not natively supported in C/C++ but can be achieved with the __uint128_t extension which is widely supported by GCC. The patch checks for __uint128_t support; when support is lacking, we fallback on the existing approach. For example, if we replace the above code by the following, we get a substantial performance boost on skylake microarchitectures. E.g., it can be twice as fast to shuffle arrays of 1 million elements (e.g., using the following benchmark: https://github.com/lemire/simple_cpp_shuffle_benchmark ) const __uctype __uerange = __urange + 1; // __urange can be zero uint64_t __product = (__uctype(__urng()) - __urngmin) * __uerange; uint32_t __lsb = uint32_t(__product); if(__lsb < __uerange) { uint64_t __threshold = -__uerange % __uerange; while (__lsb < __threshold) { __product = (__uctype(__urng()) - __urngmin) * __uerange; __lsb = uint32_t(__product); } } Included is a patch that would bring better performance (e.g., 2x gain) to std::uniform_int_distribution in some cases. This patch proposal was previously submitted solely to the libc++ list and improved following feedback by Jonathan Wakely. Reference: Fast Random Integer Generation in an Interval, ACM Transactions on Modeling and Computer Simulation 29 (1), 2019 https://arxiv.org/abs/1805.10941 Index: trunk/libstdc++-v3/include/bits/uniform_int_dist.h === --- trunk/libstdc++-v3/include/bits/uniform_int_dist.h (revision 275324) +++ trunk/libstdc++-v3/include/bits/uniform_int_dist.h (working copy) @@ -33,6 +33,7 @@ #include #include +#include namespace std _GLIBCXX_VISIBILITY(default) { @@ -242,15 +243,59 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION if (__urngrange > __urange) { - // downscaling - const __uctype __uerange = __urange + 1; // __urange can be zero - const __uctype __scaling = __urngrange / __uerange; - const __uctype __past = __uerange * __scaling; - do - __ret = __uctype(__urng()) - __urngmin; - while (__ret >= __past); - __ret /= __scaling; - } + const __uctype __uerange = __urange + 1; // __urange can be zero +#if _GLIBCXX_USE_INT128 == 1 +if(sizeof(__uctype) == sizeof(uint64_t) and + (__urngrange == numeric_limits::max())) +{ + // 64-bit case + // reference: Fast Random Integer Generation in an Interval + // ACM Transactions on Modeling and Computer Simulation 29 (1), 2019 + // https://arxiv.org/abs/1805.10941 + __uint128_t __product = (__uctype(__urng()) - __urngmin) * __uerange; + uint64_t __lsb = uint64_t(__product); + if(__lsb < __uerange) + { +uint64_t __threshold = -__uerange % __uerange; +while (__lsb < __threshold) +{ + __product = (__uctype(__urng()) - __urngmin) * __uerange; + __lsb = uint64_t(__product); +} + } + __ret = __product >> 64; +} +else +#endif // _GLIBCXX_USE_INT128 == 1 +if(sizeof(__uctype) == sizeof(uint32_t) + and (__urngrange == numeric_limits::max()) ) +{ + // 32-bit case + // reference: Fast Random Integer Generation in an Interval + // ACM Transactions on Modeling and Computer Simulation 29 (1), 2019 + // https://arxiv.org/abs/1805.10941 + uint64_t __product = (__uctype(__urng()) - __urngmin) * __uerange; + uint32_t __lsb = uint32_t(__product); + if(__lsb < __uerange) { +uint64_t __threshold = -__uerange % __uerange; +while (__lsb < __threshold) { + __product = (__uctype(__urng()) - __urngmin) * __uerange; + __lsb = uint32_t(__product); +} + } + __ret = __product >> 32; +} +else +{ + // fallback case (2 divisions) + const __uctype __scaling = __urngrange / __uerange; + const __uctype __past = __uerange * _
Re: [PATCH, libstdc++] Improve the performance of std::uniform_int_distribution (fewer divisions)
> > >The condition above checks that __urngrange == 2**64-1 which means > >that U::max() - U::min() is the maximum 64-bit value, which means > >means U::max()==2**64-1 and U::min()==0. So if U::min() is 0 we don't > >need to subtract it. > That sounds correct.
Re: [PATCH, libstdc++] Improve the performance of std::uniform_int_distribution (fewer divisions)
The updated patch looks good to me. It is indeed cleaner to have a separate (static) function. It might be nice to add a comment to explain the _S_nd function maybe with a comment like "returns a random value in [0,__range) without any bias" (or something to that effect). Otherwise, it is algorithmically correct. On Mon, Oct 5, 2020 at 7:40 PM Jonathan Wakely wrote: > On 06/10/20 00:25 +0100, Jonathan Wakely wrote: > >I'm sorry it's taken a year to review this properly. Comments below ... > > > >On 27/09/19 14:18 -0400, Daniel Lemire wrote: > >>(This is a revised patch proposal. I am revising both the description > >>and the code itself.) > >> > >>Even on recent processors, integer division is relatively expensive. > >>The current implementation of std::uniform_int_distribution typically > >>requires two divisions by invocation: > >> > >> // downscaling > >>const __uctype __uerange = __urange + 1; // __urange can be zero > >>const __uctype __scaling = __urngrange / __uerange; > >>const __uctype __past = __uerange * __scaling; > >>do > >> __ret = __uctype(__urng()) - __urngmin; > >>while (__ret >= __past); > >>__ret /= __scaling; > >> > >>We can achieve the same algorithmic result with at most one division, > >>and typically no division at all without requiring more calls to the > >>random number generator. > >>This was recently added to Swift ( > https://github.com/apple/swift/pull/25286) > >> > >>The main challenge is that we need to be able to compute the "full" > >>product. E.g., given two 64-bit integers, we want the 128-bit result; > >>given two 32-bit integers we want the 64-bit result. This is fast on > >>common processors. > >>The 128-bit product is not natively supported in C/C++ but can be > >>achieved with the > >>__int128 extension when it is available. The patch checks for > >>__int128 support; when > >>support is lacking, we fallback on the existing approach which uses > >>two divisions per > >>call. > >> > >>For example, if we replace the above code by the following, we get a > substantial > >>performance boost on skylake microarchitectures. E.g., it can > >>be twice as fast to shuffle arrays of 1 million elements (e.g., using > >>the followingbenchmark: > https://github.com/lemire/simple_cpp_shuffle_benchmark ) > >> > >> > >> unsigned __int128 __product = (unsigned > >>__int128)(__uctype(__urng()) - __urngmin) * uint64_t(__uerange); > >> uint64_t __lsb = uint64_t(__product); > >> if(__lsb < __uerange) > >> { > >> uint64_t __threshold = -uint64_t(__uerange) % uint64_t(__uerange); > >> while (__lsb < __threshold) > >> { > >> __product = (unsigned __int128)(__uctype(__urng()) - > >>__urngmin) * (unsigned __int128)(__uerange); > >> __lsb = uint64_t(__product); > >> } > >> } > >> __ret = __product >> 64; > >> > >>Included is a patch that would bring better performance (e.g., 2x gain) > to > >>std::uniform_int_distribution in some cases. Here are some actual > numbers... > >> > >>With this patch: > >> > >>std::shuffle(testvalues, testvalues + size, g) : 7952091 > >>ns total, 7.95 ns per input key > >> > >>Before this patch: > >> > >>std::shuffle(testvalues, testvalues + size, g) : > >>14954058 ns total, 14.95 ns per input key > >> > >> > >>Compiler: GNU GCC 8.3 with -O3, hardware: Skylake (i7-6700). > >> > >>Furthermore, the new algorithm is unbiased, so the randomness of the > >>result is not affected. > >> > >>I ran both performance and biases tests with the proposed patch. > >> > >> > >>This patch proposal was improved following feedback by Jonathan > >>Wakely. An earlier version used the __uint128_t type, which is widely > >>supported but not used in the C++ library, instead we now use unsigned > >>__int128. Furthermore, the previous patch was accidentally broken: it > >>was not computing the full product since a rhs cast was missing. These > >>issues are fixed and verified. > > > >After looking at GCC's internals, it looks like __uint128_t is > >actually fine to use, even though we never cu