[PATCH] Remove unnecessary "& 1" in year_month_day_last::day()

2023-11-05 Thread Cassio Neri
When year_month_day_last::day() was implemented, Dr. Matthias Kretz realised
that the operation "& 1" wasn't necessary but we did not patch it at that
time. This patch removes the unnecessary operation.

libstdc++-v3/ChangeLog:

* include/std/chrono:

diff --git a/libstdc++-v3/include/std/chrono
b/libstdc++-v3/include/std/chrono
index 10e868e5a03..c979a5d05dd 100644
--- a/libstdc++-v3/include/std/chrono
+++ b/libstdc++-v3/include/std/chrono
@@ -1800,8 +1800,10 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   {
  const auto __m = static_cast(month());

- // Excluding February, the last day of month __m is either 30 or 31 or,
- // in another words, it is 30 + b = 30 | b, where b is in {0, 1}.
+ // Assume 1 <= __m <= 12, otherwise month().ok() == false and the result
+ // of day() is unspecified. Excluding February, the last day of month __m
+ // m is either 30 or 31 or, in another words, it is 30 | b, where b is in
+ // {0, 1}.

  // If __m in {1, 3, 4, 5, 6, 7}, then b is 1 if, and only if __m is odd.
  // Hence, b = __m & 1 = (__m ^ 0) & 1.
@@ -1812,10 +1814,14 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
  // Therefore, b = (__m ^ c) & 1, where c = 0, if __m < 8, or c = 1 if
  // __m >= 8, that is, c = __m >> 3.

+ // Since 30 = (0)_2 and __m <= 31 = (1)_2, we have:
+ // 30 | ((__m ^ c) & 1) == 30 | (__m ^ c), that is, the "& 1" is
+ // unnecessary.
+
  // The above mathematically justifies this implementation whose
  // performance does not depend on look-up tables being on the L1 cache.
- return chrono::day{__m != 2 ? ((__m ^ (__m >> 3)) & 1) | 30
-: _M_y.is_leap() ? 29 : 28};
+ return chrono::day{__m != 2 ? (__m ^ (__m >> 3)) | 30
+ : _M_y.is_leap() ? 29 : 28};
   }

   constexpr


Re: [PATCH] Remove unnecessary "& 1" in year_month_day_last::day()

2023-11-05 Thread Cassio Neri
I could not find any entry in gcc's bugzilla for that. Perhaps my search
wasn't good enough.


On Sun, 5 Nov 2023 at 15:58, Marc Glisse  wrote:

> On Sun, 5 Nov 2023, Cassio Neri wrote:
>
> > When year_month_day_last::day() was implemented, Dr. Matthias Kretz
> realised
> > that the operation "& 1" wasn't necessary but we did not patch it at that
> > time. This patch removes the unnecessary operation.
>
> Is there an entry in gcc's bugzilla about having the optimizer handle this
> kind of optimization?
>
> unsigned f(unsigned x){
>if(x>=32)__builtin_unreachable();
>return 30|(x&1); // --> 30|x
> }
>
> (that optimization would come in addition to your patch, doing the
> optimization by hand is still a good idea)
>
> It looks like the criterion would be a|(b&c) when the possible 1 bits of b
> are included in the certainly 1 bits of a|c.
>
> --
> Marc Glisse
>


[PATCH] Simplify year::is_leap().

2023-11-05 Thread Cassio Neri
The current implementation returns
(_M_y & (__is_multiple_of_100 ? 15 : 3)) == 0;
where __is_multiple_of_100 is calculated using an obfuscated algorithm which
saves one ror instruction when compared to _M_y % 100 == 0 [1].

In leap years calculation, it's mathematically correct to replace the
divisibility check by 100 with the one by 25. It turns out that
_M_y % 25 == 0 also saves the ror instruction [2]. Therefore, the
obfuscation is not required.

[1] https://godbolt.org/z/5PaEv6a6b
[2] https://godbolt.org/z/55G8rn77e

libstdc++-v3/ChangeLog:

* include/std/chrono:

diff --git a/libstdc++-v3/include/std/chrono
b/libstdc++-v3/include/std/chrono
index 10e868e5a03..a34b3977d59 100644
--- a/libstdc++-v3/include/std/chrono
+++ b/libstdc++-v3/include/std/chrono
@@ -835,29 +835,27 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   constexpr bool
   is_leap() const noexcept
   {
- // Testing divisibility by 100 first gives better performance, that is,
- // return (_M_y % 100 != 0 || _M_y % 400 == 0) && _M_y % 4 == 0;
-
- // It gets even faster if _M_y is in [-536870800, 536870999]
- // (which is the case here) and _M_y % 100 is replaced by
- // __is_multiple_of_100 below.
+ // Testing divisibility by 100 first gives better performance [1], i.e.,
+ // return y % 100 == 0 ? y % 400 == 0 : y % 16 == 0;
+ // Furthermore, if y % 100 == 0, then y % 400 == 0 is equivalent to
+ // y % 16 == 0, so we can simplify it to
+ // return y % 100 == 0 ? y % 16 == 0 : y % 4 == 0. // #1
+ // Similarly, we can replace 100 with 25 (which is good since y % 25 == 0
+ // requires one fewer instruction than y % 100 == 0 [2]):
+ // return y % 25 == 0 ? y % 16 == 0 : y % 4 == 0. // #2
+ // Indeed, first assume y % 4 != 0. Then y % 16 != 0 and hence, y % 4 == 0
+ // and y % 16 == 0 are both false. Therefore, #2 returns false as it
+ // should (regardless of y % 25.) Now assume y % 4 == 0. In this case,
+ // y % 25 == 0 if, and only if, y % 100 == 0, that is, #1 and #2 are
+ // equivalent. Finally, #2 is equivalent to
+ // return (y & (y % 25 == 0 ? 15 : 3)) == 0.

  // References:
  // [1] https://github.com/cassioneri/calendar
- // [2] https://accu.org/journals/overload/28/155/overload155.pdf#page=16
-
- // Furthermore, if y%100 == 0, then y%400==0 is equivalent to y%16==0,
- // so we can simplify it to (!mult_100 && y % 4 == 0) || y % 16 == 0,
- // which is equivalent to (y & (mult_100 ? 15 : 3)) == 0.
- // See https://gcc.gnu.org/pipermail/libstdc++/2021-June/052815.html
-
- constexpr uint32_t __multiplier   = 42949673;
- constexpr uint32_t __bound= 42949669;
- constexpr uint32_t __max_dividend = 1073741799;
- constexpr uint32_t __offset   = __max_dividend / 2 / 100 * 100;
- const bool __is_multiple_of_100
-  = __multiplier * (_M_y + __offset) < __bound;
- return (_M_y & (__is_multiple_of_100 ? 15 : 3)) == 0;
+ // [2] https://godbolt.org/z/55G8rn77e
+ // [3] https://gcc.gnu.org/pipermail/libstdc++/2021-June/052815.html
+
+ return (_M_y & (_M_y % 25 == 0 ? 15 : 3)) == 0;
   }

   explicit constexpr


ping: [PATCH v5] libstdc++: Remove UB from month and weekday additions and subtractions.

2023-12-01 Thread Cassio Neri
ping.

On Sat, 25 Nov 2023 at 13:52, Cassio Neri  wrote:

> The following invoke signed integer overflow (UB) [1]:
>
>   month   + months{MAX} // where MAX is the maximum value of months::rep
>   month   + months{MIN} // where MIN is the maximum value of months::rep
>   month   - months{MIN} // where MIN is the minimum value of months::rep
>   weekday + days  {MAX} // where MAX is the maximum value of days::rep
>   weekday - days  {MIN} // where MIN is the minimum value of days::rep
>
> For the additions to MAX, the crux of the problem is that, in libstdc++,
> months::rep and days::rep are int64_t. Other implementations use int32_t,
> cast
> operands to int64_t and perform arithmetic operations without risk of
> overflowing.
>
> For month + months{MIN}, the implementation follows the Standard's "returns
> clause" and evaluates:
>
>modulo(static_cast(unsigned{__x}) + (__y.count() - 1), 12);
>
> Overflow occurs when MIN - 1 is evaluated. Casting to a larger type could
> help
> but, unfortunately again, this is not possible for libstdc++.
>
> For the subtraction of MIN, the problem is that -MIN is not representable.
>
> It's fair to say that the intention is for these additions/subtractions to
> be performed in modulus (12 or 7) arithmetic so that no overflow is
> expected.
>
> To fix these UB, this patch implements:
>
>   template 
>   unsigned __add_modulo(unsigned __x, _T __y);
>
>   template 
>   unsigned __sub_modulo(unsigned __x, _T __y);
>
> which respectively, returns the remainder of Euclidean division of, __x +
> __y
> and __x - __y by __d without overflowing. These functions replace
>
>   constexpr unsigned __modulo(long long __n, unsigned __d);
>
> which also calculates the reminder of __n, where __n is the result of the
> addition or subtraction. Hence, these operations might invoke UB before
> __modulo
> is called and thus, __modulo can't do anything to remediate the issue.
>
> In addition to solve the UB issues, __add_modulo and __sub_modulo allow
> better
> codegen (shorter and branchless) on x86-64 and ARM [2].
>
> [1] https://godbolt.org/z/a9YfWdn57
> [2] https://godbolt.org/z/Gh36cr7E4
>
> libstdc++-v3/ChangeLog:
>
> * include/std/chrono: Fix + and - for months and weekdays.
> * testsuite/std/time/month/1.cc: Add constexpr tests against
> overflow.
> * testsuite/std/time/month/2.cc: New test for extreme values.
> * testsuite/std/time/weekday/1.cc: Add constexpr tests against
> overflow.
> * testsuite/std/time/weekday/2.cc: New test for extreme values.
> ---
> Good for trunk?
>
> Changes with respect to previous versions:
>   v5: Fix typos in commit message.
>   v4: Remove UB also from operator-.
>   v3: Fix screwed up email send with v2.
>   v2: Replace _T with _Tp and _U with _Up. Remove copyright+license from
> test.
>
>  libstdc++-v3/include/std/chrono  | 79 +---
>  libstdc++-v3/testsuite/std/time/month/1.cc   | 19 +
>  libstdc++-v3/testsuite/std/time/month/2.cc   | 32 
>  libstdc++-v3/testsuite/std/time/weekday/1.cc | 16 +++-
>  libstdc++-v3/testsuite/std/time/weekday/2.cc | 32 
>  5 files changed, 151 insertions(+), 27 deletions(-)
>  create mode 100644 libstdc++-v3/testsuite/std/time/month/2.cc
>  create mode 100644 libstdc++-v3/testsuite/std/time/weekday/2.cc
>
> diff --git a/libstdc++-v3/include/std/chrono
> b/libstdc++-v3/include/std/chrono
> index e4ba6eafceb..1b7cdb96e1c 100644
> --- a/libstdc++-v3/include/std/chrono
> +++ b/libstdc++-v3/include/std/chrono
> @@ -501,18 +501,47 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>
>  namespace __detail
>  {
> -  // Compute the remainder of the Euclidean division of __n divided
> by __d.
> -  // Euclidean division truncates toward negative infinity and always
> -  // produces a remainder in the range of [0,__d-1] (whereas standard
> -  // division truncates toward zero and yields a nonpositive remainder
> -  // for negative __n).
> +  // Helper to __add_modulo and __sub_modulo.
> +  template 
> +  constexpr auto
> +  __modulo_offset()
> +  {
> +   using _Up = make_unsigned_t<_Tp>;
> +   auto constexpr __a = _Up(-1) - _Up(255 + __d - 2);
> +   auto constexpr __b = _Up(__d * (__a / __d) - 1);
> +   // Notice: b <= a - 1 <= _Up(-1) - (255 + d - 1) and b % d = d - 1.
> +   return _Up(-1) - __b; // >= 255 + d - 1
> +  }
> +
> +  // Compute the remainder of the Euclidean division of __x + __y
> divided by
> +  // __d without overflowing.  Typically, __x <= 255 + d - 1 is sum of
> +  

[PING, PATCH v5] libstdc++: Remove UB from month and weekday additions and subtractions.

2023-12-10 Thread Cassio Neri
The following invoke signed integer overflow (UB) [1]:

  month   + months{MAX} // where MAX is the maximum value of months::rep
  month   + months{MIN} // where MIN is the maximum value of months::rep
  month   - months{MIN} // where MIN is the minimum value of months::rep
  weekday + days  {MAX} // where MAX is the maximum value of days::rep
  weekday - days  {MIN} // where MIN is the minimum value of days::rep

For the additions to MAX, the crux of the problem is that, in libstdc++,
months::rep and days::rep are int64_t. Other implementations use int32_t, cast
operands to int64_t and perform arithmetic operations without risk of
overflowing.

For month + months{MIN}, the implementation follows the Standard's "returns
clause" and evaluates:

   modulo(static_cast(unsigned{__x}) + (__y.count() - 1), 12);

Overflow occurs when MIN - 1 is evaluated. Casting to a larger type could help
but, unfortunately again, this is not possible for libstdc++.

For the subtraction of MIN, the problem is that -MIN is not representable.

It's fair to say that the intention is for these additions/subtractions to
be performed in modulus (12 or 7) arithmetic so that no overflow is expected.

To fix these UB, this patch implements:

  template 
  unsigned __add_modulo(unsigned __x, _T __y);

  template 
  unsigned __sub_modulo(unsigned __x, _T __y);

which respectively, returns the remainder of Euclidean division of, __x + __y
and __x - __y by __d without overflowing. These functions replace

  constexpr unsigned __modulo(long long __n, unsigned __d);

which also calculates the reminder of __n, where __n is the result of the
addition or subtraction. Hence, these operations might invoke UB before __modulo
is called and thus, __modulo can't do anything to remediate the issue.

In addition to solve the UB issues, __add_modulo and __sub_modulo allow better
codegen (shorter and branchless) on x86-64 and ARM [2].

[1] https://godbolt.org/z/a9YfWdn57
[2] https://godbolt.org/z/Gh36cr7E4

libstdc++-v3/ChangeLog:

* include/std/chrono: Fix + and - for months and weekdays.
* testsuite/std/time/month/1.cc: Add constexpr tests against overflow.
* testsuite/std/time/month/2.cc: New test for extreme values.
* testsuite/std/time/weekday/1.cc: Add constexpr tests against overflow.
* testsuite/std/time/weekday/2.cc: New test for extreme values.
---

Ping.

Good for trunk?

Changes with respect to previous versions:
  v5: Fix typos in commit message.
  v4: Remove UB also from operator-.
  v3: Fix screwed up email send with v2.
  v2: Replace _T with _Tp and _U with _Up. Remove copyright+license from test.

 libstdc++-v3/include/std/chrono  | 79 +---
 libstdc++-v3/testsuite/std/time/month/1.cc   | 19 +
 libstdc++-v3/testsuite/std/time/month/2.cc   | 32 
 libstdc++-v3/testsuite/std/time/weekday/1.cc | 16 +++-
 libstdc++-v3/testsuite/std/time/weekday/2.cc | 32 
 5 files changed, 151 insertions(+), 27 deletions(-)
 create mode 100644 libstdc++-v3/testsuite/std/time/month/2.cc
 create mode 100644 libstdc++-v3/testsuite/std/time/weekday/2.cc

diff --git a/libstdc++-v3/include/std/chrono b/libstdc++-v3/include/std/chrono
index e4ba6eafceb..1b7cdb96e1c 100644
--- a/libstdc++-v3/include/std/chrono
+++ b/libstdc++-v3/include/std/chrono
@@ -501,18 +501,47 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION

 namespace __detail
 {
-  // Compute the remainder of the Euclidean division of __n divided by __d.
-  // Euclidean division truncates toward negative infinity and always
-  // produces a remainder in the range of [0,__d-1] (whereas standard
-  // division truncates toward zero and yields a nonpositive remainder
-  // for negative __n).
+  // Helper to __add_modulo and __sub_modulo.
+  template 
+  constexpr auto
+  __modulo_offset()
+  {
+   using _Up = make_unsigned_t<_Tp>;
+   auto constexpr __a = _Up(-1) - _Up(255 + __d - 2);
+   auto constexpr __b = _Up(__d * (__a / __d) - 1);
+   // Notice: b <= a - 1 <= _Up(-1) - (255 + d - 1) and b % d = d - 1.
+   return _Up(-1) - __b; // >= 255 + d - 1
+  }
+
+  // Compute the remainder of the Euclidean division of __x + __y divided 
by
+  // __d without overflowing.  Typically, __x <= 255 + d - 1 is sum of
+  // weekday/month with a shift in [0, d - 1] and __y is a duration count.
+  template 
+  constexpr unsigned
+  __add_modulo(unsigned __x, _Tp __y)
+  {
+   using _Up = make_unsigned_t<_Tp>;
+   // For __y >= 0, _Up(__y) has the same mathematical value as __y and
+   // this function simply returns (__x + _Up(__y)) % d.  Typically, this
+   // doesn't overflow since the range of _Up contains many more positive
+   // values than _Tp's.  For __y < 0, _Up(__y) has a mathematical value in
+   // the upper-half range of _Up so that adding a positive value to it
+   // might overflow.  Moreover, most likely, _Up(__y

[PATCH v2] Remove unnecessary "& 1" in year_month_day_last::day()

2023-11-10 Thread Cassio Neri
When year_month_day_last::day() was implemented, Dr. Matthias Kretz realised
that the operation "& 1" wasn't necessary but we did not patch it at that
time. This patch removes the unnecessary operation.

libstdc++-v3/ChangeLog:

* include/std/chrono:

---
 libstdc++-v3/include/std/chrono | 24 ++--
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/libstdc++-v3/include/std/chrono
b/libstdc++-v3/include/std/chrono
index 10e868e5a03..a826982803b 100644
--- a/libstdc++-v3/include/std/chrono
+++ b/libstdc++-v3/include/std/chrono
@@ -1800,22 +1800,26 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   {
  const auto __m = static_cast(month());

- // Excluding February, the last day of month __m is either 30 or 31 or,
- // in another words, it is 30 + b = 30 | b, where b is in {0, 1}.
+ // The result is unspecified if __m < 1 or __m > 12.  Hence, assume
+ // 1 <= __m <= 12.  For __m != 2, day() == 30 or day() == 31 or, in
+ // other words, day () == 30 | b, where b is in {0, 1}.

- // If __m in {1, 3, 4, 5, 6, 7}, then b is 1 if, and only if __m is odd.
- // Hence, b = __m & 1 = (__m ^ 0) & 1.
+ // If __m in {1, 3, 4, 5, 6, 7}, then b is 1 if, and only if, __m is
+ // odd.  Hence, b = __m & 1 = (__m ^ 0) & 1.

- // If __m in {8, 9, 10, 11, 12}, then b is 1 if, and only if __m is even.
- // Hence, b = (__m ^ 1) & 1.
+ // If __m in {8, 9, 10, 11, 12}, then b is 1 if, and only if, __m is
+ // even.  Hence, b = (__m ^ 1) & 1.

  // Therefore, b = (__m ^ c) & 1, where c = 0, if __m < 8, or c = 1 if
  // __m >= 8, that is, c = __m >> 3.

- // The above mathematically justifies this implementation whose
- // performance does not depend on look-up tables being on the L1 cache.
- return chrono::day{__m != 2 ? ((__m ^ (__m >> 3)) & 1) | 30
-: _M_y.is_leap() ? 29 : 28};
+ // Since 30 = (0)_2 and __m <= 31 = (1)_2, the "& 1" in b's
+ // calculation is unnecessary.
+
+ // The performance of this implementation does not depend on look-up
+ // tables being on the L1 cache.
+ return chrono::day{__m != 2 ? (__m ^ (__m >> 3)) | 30
+  : _M_y.is_leap() ? 29 : 28};
   }

   constexpr


[PATCH v2] Simplify year::is_leap().

2023-11-10 Thread Cassio Neri
The current implementation returns
(_M_y & (__is_multiple_of_100 ? 15 : 3)) == 0;
where __is_multiple_of_100 is calculated using an obfuscated algorithm which
saves one ror instruction when compared to _M_y % 100 == 0 [1].

In leap years calculation, it's mathematically correct to replace the
divisibility check by 100 with the one by 25. It turns out that
_M_y % 25 == 0 also saves the ror instruction [2]. Therefore, the
obfuscation is not required.

[1] https://godbolt.org/z/5PaEv6a6b
[2] https://godbolt.org/z/55G8rn77e

libstdc++-v3/ChangeLog:

* include/std/chrono:

---
 libstdc++-v3/include/std/chrono | 38 ++
 1 file changed, 18 insertions(+), 20 deletions(-)

diff --git a/libstdc++-v3/include/std/chrono
b/libstdc++-v3/include/std/chrono
index 10e868e5a03..5707ed002a2 100644
--- a/libstdc++-v3/include/std/chrono
+++ b/libstdc++-v3/include/std/chrono
@@ -835,29 +835,27 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   constexpr bool
   is_leap() const noexcept
   {
- // Testing divisibility by 100 first gives better performance, that is,
- // return (_M_y % 100 != 0 || _M_y % 400 == 0) && _M_y % 4 == 0;
-
- // It gets even faster if _M_y is in [-536870800, 536870999]
- // (which is the case here) and _M_y % 100 is replaced by
- // __is_multiple_of_100 below.
+ // Testing divisibility by 100 first gives better performance [1], i.e.,
+ // return y % 100 == 0 ? y % 400 == 0 : y % 16 == 0;
+ // Furthermore, if y % 100 == 0, then y % 400 == 0 is equivalent to
+ // y % 16 == 0, so we can simplify it to
+ // return y % 100 == 0 ? y % 16 == 0 : y % 4 == 0.  // #1
+ // Similarly, we can replace 100 with 25 (which is good since
+ // y % 25 == 0 requires one fewer instruction than y % 100 == 0 [2]):
+ // return y % 25 == 0 ? y % 16 == 0 : y % 4 == 0.  // #2
+ // Indeed, first assume y % 4 != 0.  Then y % 16 != 0 and hence,
+ // y % 4 == 0 and y % 16 == 0 are both false.  Therefore, #2 returns
+ // false as it should (regardless of y % 25.) Now assume y % 4 == 0.  In
+ // this case, y % 25 == 0 if, and only if, y % 100 == 0, that is, #1 and
+ // #2 are equivalent.  Finally, #2 is equivalent to
+ // return (y & (y % 25 == 0 ? 15 : 3)) == 0.

  // References:
  // [1] https://github.com/cassioneri/calendar
- // [2] https://accu.org/journals/overload/28/155/overload155.pdf#page=16
-
- // Furthermore, if y%100 == 0, then y%400==0 is equivalent to y%16==0,
- // so we can simplify it to (!mult_100 && y % 4 == 0) || y % 16 == 0,
- // which is equivalent to (y & (mult_100 ? 15 : 3)) == 0.
- // See https://gcc.gnu.org/pipermail/libstdc++/2021-June/052815.html
-
- constexpr uint32_t __multiplier   = 42949673;
- constexpr uint32_t __bound= 42949669;
- constexpr uint32_t __max_dividend = 1073741799;
- constexpr uint32_t __offset   = __max_dividend / 2 / 100 * 100;
- const bool __is_multiple_of_100
-  = __multiplier * (_M_y + __offset) < __bound;
- return (_M_y & (__is_multiple_of_100 ? 15 : 3)) == 0;
+ // [2] https://godbolt.org/z/55G8rn77e
+ // [3] https://gcc.gnu.org/pipermail/libstdc++/2021-June/052815.html
+
+ return (_M_y & (_M_y % 25 == 0 ? 15 : 3)) == 0;
   }

   explicit constexpr


[PATCH] libstdc++: Remove unnecessary "& 1" from year_month_day_last::day().

2023-11-11 Thread Cassio Neri
When year_month_day_last::day() was implemented, Dr. Matthias Kretz realised
that the operation "& 1" wasn't necessary but we did not patch it at that
time. This patch removes the unnecessary operation.

libstdc++-v3/ChangeLog:

* include/std/chrono: Remove &1 from year_month_day_last::day().
---

Previous versions of this patch failed to apply. I hope it works this time.

OK for trunk?

 libstdc++-v3/include/std/chrono | 24 ++--
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/libstdc++-v3/include/std/chrono b/libstdc++-v3/include/std/chrono
index 10e868e5a03..a826982803b 100644
--- a/libstdc++-v3/include/std/chrono
+++ b/libstdc++-v3/include/std/chrono
@@ -1800,22 +1800,26 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   {
 const auto __m = static_cast(month());

-// Excluding February, the last day of month __m is either 30 or 31 or,
-// in another words, it is 30 + b = 30 | b, where b is in {0, 1}.
+// The result is unspecified if __m < 1 or __m > 12.  Hence, assume
+// 1 <= __m <= 12.  For __m != 2, day() == 30 or day() == 31 or, in
+// other words, day () == 30 | b, where b is in {0, 1}.

-// If __m in {1, 3, 4, 5, 6, 7}, then b is 1 if, and only if __m is odd.
-// Hence, b = __m & 1 = (__m ^ 0) & 1.
+// If __m in {1, 3, 4, 5, 6, 7}, then b is 1 if, and only if, __m is
+// odd.  Hence, b = __m & 1 = (__m ^ 0) & 1.

-// If __m in {8, 9, 10, 11, 12}, then b is 1 if, and only if __m is even.
-// Hence, b = (__m ^ 1) & 1.
+// If __m in {8, 9, 10, 11, 12}, then b is 1 if, and only if, __m is
+// even.  Hence, b = (__m ^ 1) & 1.

 // Therefore, b = (__m ^ c) & 1, where c = 0, if __m < 8, or c = 1 if
 // __m >= 8, that is, c = __m >> 3.

-// The above mathematically justifies this implementation whose
-// performance does not depend on look-up tables being on the L1 cache.
-return chrono::day{__m != 2 ? ((__m ^ (__m >> 3)) & 1) | 30
-: _M_y.is_leap() ? 29 : 28};
+// Since 30 = (0)_2 and __m <= 31 = (1)_2, the "& 1" in b's
+// calculation is unnecessary.
+
+// The performance of this implementation does not depend on look-up
+// tables being on the L1 cache.
+return chrono::day{__m != 2 ? (__m ^ (__m >> 3)) | 30
+  : _M_y.is_leap() ? 29 : 28};
   }

   constexpr
--
2.41.0


[PATCH v4] libstdc++: Remove unnecessary "& 1" from year_month_day_last::day().

2023-11-11 Thread Cassio Neri
When year_month_day_last::day() was implemented, Dr. Matthias Kretz realised
that the operation "& 1" wasn't necessary but we did not patch it at that
time. This patch removes the unnecessary operation.

libstdc++-v3/ChangeLog:

* include/std/chrono: Remove &1 from year_month_day_last::day().
---
Previous versions of this patch failed to apply. I hope it works this time.

OK for trunk?

 libstdc++-v3/include/std/chrono | 24 ++--
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/libstdc++-v3/include/std/chrono b/libstdc++-v3/include/std/chrono
index 10e868e5a03..a826982803b 100644
--- a/libstdc++-v3/include/std/chrono
+++ b/libstdc++-v3/include/std/chrono
@@ -1800,22 +1800,26 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   {
const auto __m = static_cast(month());

-   // Excluding February, the last day of month __m is either 30 or 31 or,
-   // in another words, it is 30 + b = 30 | b, where b is in {0, 1}.
+   // The result is unspecified if __m < 1 or __m > 12.  Hence, assume
+   // 1 <= __m <= 12.  For __m != 2, day() == 30 or day() == 31 or, in
+   // other words, day () == 30 | b, where b is in {0, 1}.

-   // If __m in {1, 3, 4, 5, 6, 7}, then b is 1 if, and only if __m is odd.
-   // Hence, b = __m & 1 = (__m ^ 0) & 1.
+   // If __m in {1, 3, 4, 5, 6, 7}, then b is 1 if, and only if, __m is
+   // odd.  Hence, b = __m & 1 = (__m ^ 0) & 1.

-   // If __m in {8, 9, 10, 11, 12}, then b is 1 if, and only if __m is 
even.
-   // Hence, b = (__m ^ 1) & 1.
+   // If __m in {8, 9, 10, 11, 12}, then b is 1 if, and only if, __m is
+   // even.  Hence, b = (__m ^ 1) & 1.

// Therefore, b = (__m ^ c) & 1, where c = 0, if __m < 8, or c = 1 if
// __m >= 8, that is, c = __m >> 3.

-   // The above mathematically justifies this implementation whose
-   // performance does not depend on look-up tables being on the L1 cache.
-   return chrono::day{__m != 2 ? ((__m ^ (__m >> 3)) & 1) | 30
-   : _M_y.is_leap() ? 29 : 28};
+   // Since 30 = (0)_2 and __m <= 31 = (1)_2, the "& 1" in b's
+   // calculation is unnecessary.
+
+   // The performance of this implementation does not depend on look-up
+   // tables being on the L1 cache.
+   return chrono::day{__m != 2 ? (__m ^ (__m >> 3)) | 30
+ : _M_y.is_leap() ? 29 : 28};
   }

   constexpr
--
2.41.0



[PATCH v3] libstdc++: Simplify year::is_leap().

2023-11-11 Thread Cassio Neri
The current implementation returns
(_M_y & (__is_multiple_of_100 ? 15 : 3)) == 0;
where __is_multiple_of_100 is calculated using an obfuscated algorithm which
saves one ror instruction when compared to _M_y % 100 == 0 [1].

In leap years calculation, it's correct to replace the divisibility check by
100 with the one by 25. It turns out that _M_y % 25 == 0 also saves the ror
instruction [2]. Therefore, the obfuscation is not required.

[1] https://godbolt.org/z/5PaEv6a6b
[2] https://godbolt.org/z/55G8rn77e

libstdc++-v3/ChangeLog:

* include/std/chrono: Clear code.
---

Previous versions of this patch failed to apply. Hope this one works.

Good for trunk?

 libstdc++-v3/include/std/chrono | 40 -
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/libstdc++-v3/include/std/chrono b/libstdc++-v3/include/std/chrono
index 10e868e5a03..e18f57229e8 100644
--- a/libstdc++-v3/include/std/chrono
+++ b/libstdc++-v3/include/std/chrono
@@ -835,29 +835,29 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   constexpr bool
   is_leap() const noexcept
   {
-   // Testing divisibility by 100 first gives better performance, that is,
-   // return (_M_y % 100 != 0 || _M_y % 400 == 0) && _M_y % 4 == 0;
-
-   // It gets even faster if _M_y is in [-536870800, 536870999]
-   // (which is the case here) and _M_y % 100 is replaced by
-   // __is_multiple_of_100 below.
+   // Testing divisibility by 100 first gives better performance [1], i.e.,
+   // return _M_y % 100 == 0 ? _M_y % 400 == 0 : _M_y % 16 == 0;
+   // Furthermore, if _M_y % 100 == 0, then _M_y % 400 == 0 is equivalent
+   // to _M_y % 16 == 0, so we can simplify it to
+   // return _M_y % 100 == 0 ? _M_y % 16 == 0 : _M_y % 4 == 0.  // #1
+   // Similarly, we can replace 100 with 25 (which is good since
+   // _M_y % 25 == 0 requires one fewer instruction than _M_y % 100 == 0
+   // [2]):
+   // return _M_y % 25 == 0 ? _M_y % 16 == 0 : _M_y % 4 == 0.  // #2
+   // Indeed, first assume _M_y % 4 != 0.  Then _M_y % 16 != 0 and hence,
+   // _M_y % 4 == 0 and _M_y % 16 == 0 are both false.  Therefore, #2
+   // returns false as it should (regardless of _M_y % 25.) Now assume
+   // _M_y % 4 == 0.  In this case, _M_y % 25 == 0 if, and only if,
+   // _M_y % 100 == 0, that is, #1 and #2 are equivalent.  Finally, #2 is
+   // equivalent to
+   // return (_M_y & (_M_y % 25 == 0 ? 15 : 3)) == 0.

// References:
// [1] https://github.com/cassioneri/calendar
-   // [2] https://accu.org/journals/overload/28/155/overload155.pdf#page=16
-
-   // Furthermore, if y%100 == 0, then y%400==0 is equivalent to y%16==0,
-   // so we can simplify it to (!mult_100 && y % 4 == 0) || y % 16 == 0,
-   // which is equivalent to (y & (mult_100 ? 15 : 3)) == 0.
-   // See https://gcc.gnu.org/pipermail/libstdc++/2021-June/052815.html
-
-   constexpr uint32_t __multiplier   = 42949673;
-   constexpr uint32_t __bound= 42949669;
-   constexpr uint32_t __max_dividend = 1073741799;
-   constexpr uint32_t __offset   = __max_dividend / 2 / 100 * 100;
-   const bool __is_multiple_of_100
- = __multiplier * (_M_y + __offset) < __bound;
-   return (_M_y & (__is_multiple_of_100 ? 15 : 3)) == 0;
+   // [2] https://godbolt.org/z/55G8rn77e
+   // [3] https://gcc.gnu.org/pipermail/libstdc++/2021-June/052815.html
+
+   return (_M_y & (_M_y % 25 == 0 ? 15 : 3)) == 0;
   }

   explicit constexpr
--
2.41.0



[PATCH] Fix UB in weekday::weekday(sys_days) and add test.

2023-11-11 Thread Cassio Neri
The following has undefined behaviour (signed overflow) [1]:
weekday max{sys_days{days{numeric_limits::max()}}};

The issue is in this line when __n is very large and __n + 4 overflows:
return weekday(__n >= -4 ? (__n + 4) % 7 : (__n + 5) % 7 + 6);

In addition to fixing this bug, the new implementation makes the compiler emit
shorter and branchless code for x86-64 and ARM [2].

[1] https://godbolt.org/z/1s5bv7KfT
[2] https://godbolt.org/z/zKsabzrhs

libstdc++-v3/ChangeLog:

* include/std/chrono: Fix weekday::_S_from_days
* testsuite/std/time/weekday/1.cc: Add test for overflow.
---

Good for trunk?

 libstdc++-v3/include/std/chrono  | 11 +--
 libstdc++-v3/testsuite/std/time/weekday/1.cc |  9 +
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/libstdc++-v3/include/std/chrono b/libstdc++-v3/include/std/chrono
index 10e868e5a03..c00dd133173 100644
--- a/libstdc++-v3/include/std/chrono
+++ b/libstdc++-v3/include/std/chrono
@@ -930,8 +930,15 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   static constexpr weekday
   _S_from_days(const days& __d)
   {
-   auto __n = __d.count();
-   return weekday(__n >= -4 ? (__n + 4) % 7 : (__n + 5) % 7 + 6);
+   using _Rep = days::rep;
+   using _URep = make_unsigned_t<_Rep>;
+   const auto __n = __d.count();
+   const auto __m = static_cast<_URep>(__n);
+
+   // 1970-01-01 (__n =  0, __m = 0) -> Thursday (4)
+   // 1969-31-12 (__n = -1, __m = _URep(-1)) -> Wednesday (3)
+   const auto __offset = __n >= 0 ? _URep(4) : 3 - _URep(-1) % 7 - 7;
+   return weekday((__m + __offset) % 7);
   }

 public:
diff --git a/libstdc++-v3/testsuite/std/time/weekday/1.cc 
b/libstdc++-v3/testsuite/std/time/weekday/1.cc
index 00278c8b01c..e89fca47d4b 100644
--- a/libstdc++-v3/testsuite/std/time/weekday/1.cc
+++ b/libstdc++-v3/testsuite/std/time/weekday/1.cc
@@ -20,6 +20,7 @@
 // Class template day [time.cal.weekday]

 #include 
+#include 

 constexpr void
 constexpr_weekday()
@@ -37,6 +38,14 @@ constexpr_weekday()
   static_assert(weekday{3}[2].weekday() == weekday{3});
   static_assert(weekday{3}[last].weekday() == weekday{3});

+  // Test for UB (overflow).
+  {
+using rep = days::rep;
+using std::numeric_limits;
+constexpr weekday max{sys_days{days{numeric_limits::max()}}};
+constexpr weekday min{sys_days{days{numeric_limits::min()}}};
+  }
+
   static_assert(weekday{sys_days{1900y/January/1}} == Monday);
   static_assert(weekday{sys_days{1970y/January/1}} == Thursday);
   static_assert(weekday{sys_days{2020y/August/21}} == Friday);
--
2.41.0



[PATCH] libstdc++: Improve operator-(weekday x, weekday y).

2023-11-13 Thread Cassio Neri
The current implementation calls __detail::__modulo which is relatively
expensive.

A better implementation is possible if we assume that x.ok() && y.ok() == true,
so that n = x.c_encoding() - y.c_encoding() is in [-6, 6]. In this case, it
suffices to return n >= 0 ? n : n + 7.

The above is allowed by [time.cal.wd.nonmembers]/5: the returned value is
unspecified when x.ok() || y.ok() == false.

The assembly emitted for x86-64 and ARM can be seen in:
https://godbolt.org/z/nMdc5vv9n.

libstdc++-v3/ChangeLog:

* include/std/chrono:
---

OK for trunk?

 libstdc++-v3/include/std/chrono | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/libstdc++-v3/include/std/chrono b/libstdc++-v3/include/std/chrono
index 10e868e5a03..6131e7e97b3 100644
--- a/libstdc++-v3/include/std/chrono
+++ b/libstdc++-v3/include/std/chrono
@@ -1036,8 +1036,8 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   friend constexpr days
   operator-(const weekday& __x, const weekday& __y) noexcept
   {
-   auto __n = static_cast(__x._M_wd) - __y._M_wd;
-   return days{__detail::__modulo(__n, 7)};
+   const auto __n = __x.c_encoding() - __y.c_encoding();
+   return static_cast(__n) >= 0 ? days{__n} : days{__n + 7};
   }
 };

--
2.41.0



[PATCH] libstdc++: Remove UB from operator+ of months and weekdays.

2023-11-16 Thread Cassio Neri
The following functions invoke signed integer overflow (UB) for some extreme
values of days and months [1]:

  weekday operator+(const weekday& x, const days& y); // #1
  month operator+(const month& x, const months& y);   // #2

For #1, the crux of the problem is that, in libstdc++, days::rep is int64_t.
Other implementations use int32_t and cast operands to int64_t and perform
arithmetic operations without fear of overflowing. For instance, #1 evaluates:

  modulo(static_cast(unsigned{x}._M_wd) + __y.count(), 7);

Sadly, libstdc++ doesn't have this luxury.  For #2, casting to a larger type
could help but all implementations follow the Standard's "Returns clause"
and evaluate:

   modulo(static_cast(unsigned{__x}) + (__y.count() - 1), 12);

Hence, overflow occurs when __y.count() is the minimum value of its type.  When
long long is larger than months::rep, this is a fix:

   modulo(static_cast(unsigned{__x}) + 11 + __y.count(), 12);

Again, this is not possible for libstdc++. To fix these UB, this patch
implements:

  template 
  unsigned __add_modulo(unsigned __x, _T __y);

which returns the remainder of Euclidean division of __x +__y by __d without
overflowing.  This function replaces

  constexpr unsigned __modulo(long long __n, unsigned __d);

which also calculates the reminder but takes the sum __n as argument at which
point the overflow might have already occurred.

In addition to solve the UB issues, __add_modulo allows shorter branchless code
on x86-64 and ARM [2].

[1] https://godbolt.org/z/WqvosbrvG
[2] https://godbolt.org/z/o63794GEE

libstdc++-v3/ChangeLog:

* include/std/chrono: Fix operator+ for months and weekdays.
* testsuite/std/time/month/1.cc: Add constexpr tests against overflow.
* testsuite/std/time/month/2.cc: New test for extreme values.
* testsuite/std/time/weekday/1.cc: Add constexpr tests against overflow.
* testsuite/std/time/weekday/2.cc: New test for extreme values.
---

If desirable, I think I'm able to do something similar for operator-(x, y)
(month/weekday x and months/days y) which is specified as:
Returns: x + -y;
All implementations follow the above and -y overflows when y has the minimum
value of its type.

 libstdc++-v3/include/std/chrono  | 61 
 libstdc++-v3/testsuite/std/time/month/1.cc   |  9 +++
 libstdc++-v3/testsuite/std/time/month/2.cc   | 47 +++
 libstdc++-v3/testsuite/std/time/weekday/1.cc |  8 +++
 libstdc++-v3/testsuite/std/time/weekday/2.cc | 47 +++
 5 files changed, 148 insertions(+), 24 deletions(-)
 create mode 100644 libstdc++-v3/testsuite/std/time/month/2.cc
 create mode 100644 libstdc++-v3/testsuite/std/time/weekday/2.cc

diff --git a/libstdc++-v3/include/std/chrono b/libstdc++-v3/include/std/chrono
index 10bdd1c4ede..02087a9374c 100644
--- a/libstdc++-v3/include/std/chrono
+++ b/libstdc++-v3/include/std/chrono
@@ -497,18 +497,38 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION

 namespace __detail
 {
-  // Compute the remainder of the Euclidean division of __n divided by __d.
-  // Euclidean division truncates toward negative infinity and always
-  // produces a remainder in the range of [0,__d-1] (whereas standard
-  // division truncates toward zero and yields a nonpositive remainder
-  // for negative __n).
+  // Compute the remainder of the Euclidean division of __x + __y divided 
by
+  // __d without overflowing.  Typically, __x <= 255 + d - 1 is sum of
+  // weekday/month and an offset in [0, d - 1] and __y is a duration count.
+  // For instance, [time.cal.month.nonmembers] says that given month x and
+  // months y, to get x + y one must calculate:
+  //
+  // modulo(static_cast(unsigned{x}) + (y.count() - 1), 12) + 1.
+  //
+  // Since y.count() is a 64-bits signed value the subtraction y.count() - 
1
+  // or the addition of this value with static_cast(unsigned{x})
+  // might overflow.  This function can be used to avoid this problem:
+  // __add_modulo<12>(unsigned{x} + 11, y.count()) + 1;
+  // (More details in the implementation of operator+(month, months).)
+  template 
   constexpr unsigned
-  __modulo(long long __n, unsigned __d)
-  {
-   if (__n >= 0)
- return __n % __d;
-   else
- return (__d + (__n % __d)) % __d;
+  __add_modulo(unsigned __x, _T __y)
+  {
+   using _U = make_unsigned_t<_T>;
+   // For __y >= 0, _U(__y) has the same mathematical value as __y and this
+   // function simply returns (__x + _U(__y)) % d.  Typically, this doesn't
+   // overflow since the range of _U contains many more positive values
+   // than _T's.  For __y < 0, _U(__y) has a mathematical value in the
+   // upper-half range of _U so that adding a positive value to it might
+   // overflow.  Moreover, most likely, _U(__y) != __y mod d.  To fix both
+   // issues we "subtract" from _U(__y) an __offset >

[PATCH v2] The following functions invoke signed integer overflow (UB) for some extreme values of days and months [1]:

2023-11-17 Thread Cassio Neri
  weekday operator+(const weekday& x, const days& y); // #1
  month operator+(const month& x, const months& y);   // #2

For #1 the problem is that in libstdc++ days::rep is int64_t. Other
implementations use int32_t and cast operands to int64_t. Hence then perform
arithmetic operations without fear of overflowing. For instance, #1 evaluates:

  modulo(static_cast(unsigned{x}._M_wd) + __y.count(), 7);

For x86-64, long long is int64 so the cast is useless.  For #2, casting to a
larger type could help but all implementations follow the Standard's "Returns
clause" and evaluate:

   modulo(static_cast(unsigned{__x}) + (__y.count() - 1), 12);

Hence, overflow occurs when __y.count() is the minimum value of its type.  When
long long is larger than months::rep, this is a fix:

   modulo(static_cast(unsigned{__x}) + 11 + __y.count(), 12);

Again, this is not possible for libstdc++.  The fix uses this new function:

  template 
  unsigned __add_modulo(unsigned __x, _T __y);

which returns the remainder of Euclidean division of __x +__y by __d without
overflowing. This function replaces

  constexpr unsigned __modulo(long long __n, unsigned __d);

In addition to solve the UB issues, __add_modulo allows shorter branchless code
on x86-64 and ARM [2].

[1] https://godbolt.org/z/WqvosbrvG
[2] https://godbolt.org/z/o63794GEE

libstdc++-v3/ChangeLog:

* include/std/chrono: Fix operator+ for months and weekdays.
* testsuite/std/time/month/1.cc: Add constexpr tests against overflow.
* testsuite/std/time/month/2.cc: New test for extreme values.
* testsuite/std/time/weekday/1.cc: Add constexpr tests against overflow.
* testsuite/std/time/weekday/2.cc: New test for extreme values.
---
 libstdc++-v3/include/std/chrono  | 61 
 libstdc++-v3/testsuite/std/time/month/1.cc   |  9 +++
 libstdc++-v3/testsuite/std/time/weekday/1.cc |  8 +++
 3 files changed, 54 insertions(+), 24 deletions(-)

 Changes with respect to previous versions:
 v2: Replaced _T with _Tp and _U with _Up. Removed copyright+license from test.

diff --git a/libstdc++-v3/include/std/chrono b/libstdc++-v3/include/std/chrono
index 10bdd1c4ede..691bb106bb9 100644
--- a/libstdc++-v3/include/std/chrono
+++ b/libstdc++-v3/include/std/chrono
@@ -497,18 +497,38 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION

 namespace __detail
 {
-  // Compute the remainder of the Euclidean division of __n divided by __d.
-  // Euclidean division truncates toward negative infinity and always
-  // produces a remainder in the range of [0,__d-1] (whereas standard
-  // division truncates toward zero and yields a nonpositive remainder
-  // for negative __n).
+  // Compute the remainder of the Euclidean division of __x + __y divided 
by
+  // __d without overflowing.  Typically, __x <= 255 + d - 1 is sum of
+  // weekday/month and an offset in [0, d - 1] and __y is a duration count.
+  // For instance, [time.cal.month.nonmembers] says that given month x and
+  // months y, to get x + y one must calculate:
+  //
+  // modulo(static_cast(unsigned{x}) + (y.count() - 1), 12) + 1.
+  //
+  // Since y.count() is a 64-bits signed value the subtraction y.count() - 
1
+  // or the addition of this value with static_cast(unsigned{x})
+  // might overflow.  This function can be used to avoid this problem:
+  // __add_modulo<12>(unsigned{x} + 11, y.count()) + 1;
+  // (More details in the implementation of operator+(month, months).)
+  template 
   constexpr unsigned
-  __modulo(long long __n, unsigned __d)
-  {
-   if (__n >= 0)
- return __n % __d;
-   else
- return (__d + (__n % __d)) % __d;
+  __add_modulo(unsigned __x, _Tp __y)
+  {
+   using _Up = make_unsigned_t<_Tp>;
+   // For __y >= 0, _Up(__y) has the same mathematical value as __y and
+   // this function simply returns (__x + _Up(__y)) % d.  Typically, this
+   // doesn't overflow since the range of _Up contains many more positive
+   // values than _Tp's.  For __y < 0, _Up(__y) has a mathematical value in
+   // the upper-half range of _Up so that adding a positive value to it
+   // might overflow.  Moreover, most likely, _Up(__y) != __y mod d.  To
+   // fix both issues we from _Up(__y)"subtract"  an __offset >=
+   // 255 + d - 1 to make room for the addition to __x and shift the modulo
+   // to the correct value.
+   auto constexpr __a = _Up(-1) - _Up(255 + __d - 2);
+   auto constexpr __b = _Up(__d * (__a / __d) - 1);
+   // Notice: b <= a - 1 <= _Up(-1) - _Up(255 + d - 1) and b % d = d - 1.
+   auto const __offset = __y >= 0 ? _Up(0) : __b - _Up(-1);
+   return (__x + _Up(__y) + __offset) % __d;
   }

   inline constexpr unsigned __days_per_month[12]
@@ -700,8 +720,10 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   friend constexpr month
   operator+(const month& __x, const mont

[PATCH v3] libstdc++: Remove UB from operator+ of months and weekdays.

2023-11-17 Thread Cassio Neri
The following functions invoke signed integer overflow (UB) for some extreme
values of days and months [1]:

  weekday operator+(const weekday& x, const days& y); // #1
  month operator+(const month& x, const months& y);   // #2

For #1 the problem is that in libstdc++ days::rep is int64_t. Other
implementations use int32_t and cast operands to int64_t. Hence then perform
arithmetic operations without fear of overflowing. For instance, #1 evaluates:

  modulo(static_cast(unsigned{x}._M_wd) + __y.count(), 7);

For x86-64, long long is int64 so the cast is useless.  For #2, casting to a
larger type could help but all implementations follow the Standard's "Returns
clause" and evaluate:

   modulo(static_cast(unsigned{__x}) + (__y.count() - 1), 12);

Hence, overflow occurs when __y.count() is the minimum value of its type.  When
long long is larger than months::rep, this is a fix:

   modulo(static_cast(unsigned{__x}) + 11 + __y.count(), 12);

Again, this is not possible for libstdc++.  The fix uses this new function:

  template 
  unsigned __add_modulo(unsigned __x, _T __y);

which returns the remainder of Euclidean division of __x +__y by __d without
overflowing. This function replaces

  constexpr unsigned __modulo(long long __n, unsigned __d);

In addition to solve the UB issues, __add_modulo allows shorter branchless code
on x86-64 and ARM [2].

[1] https://godbolt.org/z/WqvosbrvG
[2] https://godbolt.org/z/o63794GEE

libstdc++-v3/ChangeLog:

* include/std/chrono: Fix operator+ for months and weekdays.
* testsuite/std/time/month/1.cc: Add constexpr tests against overflow.
* testsuite/std/time/month/2.cc: New test for extreme values.
* testsuite/std/time/weekday/1.cc: Add constexpr tests against overflow.
* testsuite/std/time/weekday/2.cc: New test for extreme values.
---

Changes with respect to previous versions:
 v3: Fix screwed up email send with v2. (Sorry about that. I shall learn at
 some point.)
 v2: Replaced _T with _Tp and _U with _Up. Removed copyright+license from test.

 libstdc++-v3/include/std/chrono  | 61 
 libstdc++-v3/testsuite/std/time/month/1.cc   |  9 +++
 libstdc++-v3/testsuite/std/time/month/2.cc   | 30 ++
 libstdc++-v3/testsuite/std/time/weekday/1.cc |  8 +++
 libstdc++-v3/testsuite/std/time/weekday/2.cc | 30 ++
 5 files changed, 114 insertions(+), 24 deletions(-)
 create mode 100644 libstdc++-v3/testsuite/std/time/month/2.cc
 create mode 100644 libstdc++-v3/testsuite/std/time/weekday/2.cc

diff --git a/libstdc++-v3/include/std/chrono b/libstdc++-v3/include/std/chrono
index 10bdd1c4ede..691bb106bb9 100644
--- a/libstdc++-v3/include/std/chrono
+++ b/libstdc++-v3/include/std/chrono
@@ -497,18 +497,38 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION

 namespace __detail
 {
-  // Compute the remainder of the Euclidean division of __n divided by __d.
-  // Euclidean division truncates toward negative infinity and always
-  // produces a remainder in the range of [0,__d-1] (whereas standard
-  // division truncates toward zero and yields a nonpositive remainder
-  // for negative __n).
+  // Compute the remainder of the Euclidean division of __x + __y divided 
by
+  // __d without overflowing.  Typically, __x <= 255 + d - 1 is sum of
+  // weekday/month and an offset in [0, d - 1] and __y is a duration count.
+  // For instance, [time.cal.month.nonmembers] says that given month x and
+  // months y, to get x + y one must calculate:
+  //
+  // modulo(static_cast(unsigned{x}) + (y.count() - 1), 12) + 1.
+  //
+  // Since y.count() is a 64-bits signed value the subtraction y.count() - 
1
+  // or the addition of this value with static_cast(unsigned{x})
+  // might overflow.  This function can be used to avoid this problem:
+  // __add_modulo<12>(unsigned{x} + 11, y.count()) + 1;
+  // (More details in the implementation of operator+(month, months).)
+  template 
   constexpr unsigned
-  __modulo(long long __n, unsigned __d)
-  {
-   if (__n >= 0)
- return __n % __d;
-   else
- return (__d + (__n % __d)) % __d;
+  __add_modulo(unsigned __x, _Tp __y)
+  {
+   using _Up = make_unsigned_t<_Tp>;
+   // For __y >= 0, _Up(__y) has the same mathematical value as __y and
+   // this function simply returns (__x + _Up(__y)) % d.  Typically, this
+   // doesn't overflow since the range of _Up contains many more positive
+   // values than _Tp's.  For __y < 0, _Up(__y) has a mathematical value in
+   // the upper-half range of _Up so that adding a positive value to it
+   // might overflow.  Moreover, most likely, _Up(__y) != __y mod d.  To
+   // fix both issues we from _Up(__y)"subtract"  an __offset >=
+   // 255 + d - 1 to make room for the addition to __x and shift the modulo
+   // to the correct value.
+   auto constexpr __a = _Up(-1) - _Up(255 

Re: [PATCH v3] libstdc++: Remove UB from operator+ of months and weekdays.

2023-11-18 Thread Cassio Neri
Actually, disregard this patch. I'm preparing a better one which also
tackles UB in

month - months{INT_MIN}
weekday - days{INT_MIN}

Best wishes,
Cassio.

On Sat, 18 Nov 2023, 00:19 Cassio Neri,  wrote:

> The following functions invoke signed integer overflow (UB) for some
> extreme
> values of days and months [1]:
>
>   weekday operator+(const weekday& x, const days& y); // #1
>   month operator+(const month& x, const months& y);   // #2
>
> For #1 the problem is that in libstdc++ days::rep is int64_t. Other
> implementations use int32_t and cast operands to int64_t. Hence then
> perform
> arithmetic operations without fear of overflowing. For instance, #1
> evaluates:
>
>   modulo(static_cast(unsigned{x}._M_wd) + __y.count(), 7);
>
> For x86-64, long long is int64 so the cast is useless.  For #2, casting to
> a
> larger type could help but all implementations follow the Standard's
> "Returns
> clause" and evaluate:
>
>modulo(static_cast(unsigned{__x}) + (__y.count() - 1), 12);
>
> Hence, overflow occurs when __y.count() is the minimum value of its type.
> When
> long long is larger than months::rep, this is a fix:
>
>modulo(static_cast(unsigned{__x}) + 11 + __y.count(), 12);
>
> Again, this is not possible for libstdc++.  The fix uses this new function:
>
>   template 
>   unsigned __add_modulo(unsigned __x, _T __y);
>
> which returns the remainder of Euclidean division of __x +__y by __d
> without
> overflowing. This function replaces
>
>   constexpr unsigned __modulo(long long __n, unsigned __d);
>
> In addition to solve the UB issues, __add_modulo allows shorter branchless
> code
> on x86-64 and ARM [2].
>
> [1] https://godbolt.org/z/WqvosbrvG
> [2] https://godbolt.org/z/o63794GEE
>
> libstdc++-v3/ChangeLog:
>
> * include/std/chrono: Fix operator+ for months and weekdays.
> * testsuite/std/time/month/1.cc: Add constexpr tests against
> overflow.
> * testsuite/std/time/month/2.cc: New test for extreme values.
> * testsuite/std/time/weekday/1.cc: Add constexpr tests against
> overflow.
> * testsuite/std/time/weekday/2.cc: New test for extreme values.
> ---
>
> Changes with respect to previous versions:
>  v3: Fix screwed up email send with v2. (Sorry about that. I shall learn at
>  some point.)
>  v2: Replaced _T with _Tp and _U with _Up. Removed copyright+license from
> test.
>
>  libstdc++-v3/include/std/chrono  | 61 
>  libstdc++-v3/testsuite/std/time/month/1.cc   |  9 +++
>  libstdc++-v3/testsuite/std/time/month/2.cc   | 30 ++
>  libstdc++-v3/testsuite/std/time/weekday/1.cc |  8 +++
>  libstdc++-v3/testsuite/std/time/weekday/2.cc | 30 ++
>  5 files changed, 114 insertions(+), 24 deletions(-)
>  create mode 100644 libstdc++-v3/testsuite/std/time/month/2.cc
>  create mode 100644 libstdc++-v3/testsuite/std/time/weekday/2.cc
>
> diff --git a/libstdc++-v3/include/std/chrono
> b/libstdc++-v3/include/std/chrono
> index 10bdd1c4ede..691bb106bb9 100644
> --- a/libstdc++-v3/include/std/chrono
> +++ b/libstdc++-v3/include/std/chrono
> @@ -497,18 +497,38 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>
>  namespace __detail
>  {
> -  // Compute the remainder of the Euclidean division of __n divided
> by __d.
> -  // Euclidean division truncates toward negative infinity and always
> -  // produces a remainder in the range of [0,__d-1] (whereas standard
> -  // division truncates toward zero and yields a nonpositive remainder
> -  // for negative __n).
> +  // Compute the remainder of the Euclidean division of __x + __y
> divided by
> +  // __d without overflowing.  Typically, __x <= 255 + d - 1 is sum of
> +  // weekday/month and an offset in [0, d - 1] and __y is a duration
> count.
> +  // For instance, [time.cal.month.nonmembers] says that given month
> x and
> +  // months y, to get x + y one must calculate:
> +  //
> +  // modulo(static_cast(unsigned{x}) + (y.count() - 1),
> 12) + 1.
> +  //
> +  // Since y.count() is a 64-bits signed value the subtraction
> y.count() - 1
> +  // or the addition of this value with static_cast long>(unsigned{x})
> +  // might overflow.  This function can be used to avoid this problem:
> +  // __add_modulo<12>(unsigned{x} + 11, y.count()) + 1;
> +  // (More details in the implementation of operator+(month, months).)
> +  template 
>constexpr unsigned
> -  __modulo(long long __n, unsigned __d)
> -  {
> -   if (__n >= 0)
> - return __n % __d;
> -   else
> - return (__d + (__n

[PATCH v4] libstdc++: Remove UB from month and weekday additions and subtractions.

2023-11-20 Thread Cassio Neri
The following invoke signed integer overflow (UB) [1]:

  month   + months{MAX} // where MAX is the maximum value of months::rep
  month   + months{MIN} // where MIN is the maximum value of months::rep
  month   - months{MIN} // where MIN is the minimum value of months::rep
  weekday + days  {MAX} // where MAX is the maximum value of days::rep
  weekday - days  {MIN} // where MIN is the minimum value of days::rep

For the additions to MAX, the crux of the problem is that, in libstdc++,
months::rep and days::rep are int64_t. Other implementations use int32_t, cast
operands to int64_t and perform arithmetic operations without risk of
overflowing.

For month + months{MIN}, the implementations follows the Standard's "returns
clause" and evaluates:

   modulo(static_cast(unsigned{__x}) + (__y.count() - 1), 12);

Overflow occurs when MIN - 1 is evaluated. Casting to a larger type could help
but, unfortunately again, this is not possible for libstdc++.

For the subtraction of MIN, the problem is that -MIN is not representable.

It's fair to say that the intentions are for these additions/subtractions to
be in modulus (12 or 7) arithmetic so that no overflow is expected.

To fix these UB, this patch implements:

  template 
  unsigned __add_modulo(unsigned __x, _T __y);

  template 
  unsigned __sub_modulo(unsigned __x, _T __y);

which respectively, returns the remainder of Euclidean division of, __x + __y
and __x - __y by __d without overflowing. These functions replace

  constexpr unsigned __modulo(long long __n, unsigned __d);

which also calculates the reminder of __n, where __n is the result of the
addition or subtraction. Hence, these operations might invoke UB before __modulo
is called and thus, __modulo can't do anything to remediate the issue.

In addition to solve the UB issues, __add_modulo and __sub_modulo allows better
codegen (shorter and branchless) on x86-64 and ARM [2].

[1] https://godbolt.org/z/a9YfWdn57
[2] https://godbolt.org/z/Gh36cr7E4

libstdc++-v3/ChangeLog:

* include/std/chrono: Fix + and - for months and weekdays.
* testsuite/std/time/month/1.cc: Add constexpr tests against overflow.
* testsuite/std/time/month/2.cc: New test for extreme values.
* testsuite/std/time/weekday/1.cc: Add constexpr tests against overflow.
* testsuite/std/time/weekday/2.cc: New test for extreme values.
---

Changes with respect to previous versions:
  v4: Remove UB also from operator-.
  v3: Fix screwed up email send with v2.
  v2: Replace _T with _Tp and _U with _Up. Remove copyright+license from test.

 libstdc++-v3/include/std/chrono  | 79 +---
 libstdc++-v3/testsuite/std/time/month/1.cc   | 19 +
 libstdc++-v3/testsuite/std/time/month/2.cc   | 32 
 libstdc++-v3/testsuite/std/time/weekday/1.cc | 16 +++-
 libstdc++-v3/testsuite/std/time/weekday/2.cc | 32 
 5 files changed, 151 insertions(+), 27 deletions(-)
 create mode 100644 libstdc++-v3/testsuite/std/time/month/2.cc
 create mode 100644 libstdc++-v3/testsuite/std/time/weekday/2.cc

diff --git a/libstdc++-v3/include/std/chrono b/libstdc++-v3/include/std/chrono
index e4ba6eafceb..1b7cdb96e1c 100644
--- a/libstdc++-v3/include/std/chrono
+++ b/libstdc++-v3/include/std/chrono
@@ -501,18 +501,47 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION

 namespace __detail
 {
-  // Compute the remainder of the Euclidean division of __n divided by __d.
-  // Euclidean division truncates toward negative infinity and always
-  // produces a remainder in the range of [0,__d-1] (whereas standard
-  // division truncates toward zero and yields a nonpositive remainder
-  // for negative __n).
+  // Helper to __add_modulo and __sub_modulo.
+  template 
+  constexpr auto
+  __modulo_offset()
+  {
+   using _Up = make_unsigned_t<_Tp>;
+   auto constexpr __a = _Up(-1) - _Up(255 + __d - 2);
+   auto constexpr __b = _Up(__d * (__a / __d) - 1);
+   // Notice: b <= a - 1 <= _Up(-1) - (255 + d - 1) and b % d = d - 1.
+   return _Up(-1) - __b; // >= 255 + d - 1
+  }
+
+  // Compute the remainder of the Euclidean division of __x + __y divided 
by
+  // __d without overflowing.  Typically, __x <= 255 + d - 1 is sum of
+  // weekday/month with a shift in [0, d - 1] and __y is a duration count.
+  template 
+  constexpr unsigned
+  __add_modulo(unsigned __x, _Tp __y)
+  {
+   using _Up = make_unsigned_t<_Tp>;
+   // For __y >= 0, _Up(__y) has the same mathematical value as __y and
+   // this function simply returns (__x + _Up(__y)) % d.  Typically, this
+   // doesn't overflow since the range of _Up contains many more positive
+   // values than _Tp's.  For __y < 0, _Up(__y) has a mathematical value in
+   // the upper-half range of _Up so that adding a positive value to it
+   // might overflow.  Moreover, most likely, _Up(__y) != __y mod d.  To
+   // fix both issues we subtract from _

[PATCH v5] libstdc++: Remove UB from month and weekday additions and subtractions.

2023-11-25 Thread Cassio Neri
The following invoke signed integer overflow (UB) [1]:

  month   + months{MAX} // where MAX is the maximum value of months::rep
  month   + months{MIN} // where MIN is the maximum value of months::rep
  month   - months{MIN} // where MIN is the minimum value of months::rep
  weekday + days  {MAX} // where MAX is the maximum value of days::rep
  weekday - days  {MIN} // where MIN is the minimum value of days::rep

For the additions to MAX, the crux of the problem is that, in libstdc++,
months::rep and days::rep are int64_t. Other implementations use int32_t, cast
operands to int64_t and perform arithmetic operations without risk of
overflowing.

For month + months{MIN}, the implementation follows the Standard's "returns
clause" and evaluates:

   modulo(static_cast(unsigned{__x}) + (__y.count() - 1), 12);

Overflow occurs when MIN - 1 is evaluated. Casting to a larger type could help
but, unfortunately again, this is not possible for libstdc++.

For the subtraction of MIN, the problem is that -MIN is not representable.

It's fair to say that the intention is for these additions/subtractions to
be performed in modulus (12 or 7) arithmetic so that no overflow is expected.

To fix these UB, this patch implements:

  template 
  unsigned __add_modulo(unsigned __x, _T __y);

  template 
  unsigned __sub_modulo(unsigned __x, _T __y);

which respectively, returns the remainder of Euclidean division of, __x + __y
and __x - __y by __d without overflowing. These functions replace

  constexpr unsigned __modulo(long long __n, unsigned __d);

which also calculates the reminder of __n, where __n is the result of the
addition or subtraction. Hence, these operations might invoke UB before __modulo
is called and thus, __modulo can't do anything to remediate the issue.

In addition to solve the UB issues, __add_modulo and __sub_modulo allow better
codegen (shorter and branchless) on x86-64 and ARM [2].

[1] https://godbolt.org/z/a9YfWdn57
[2] https://godbolt.org/z/Gh36cr7E4

libstdc++-v3/ChangeLog:

* include/std/chrono: Fix + and - for months and weekdays.
* testsuite/std/time/month/1.cc: Add constexpr tests against overflow.
* testsuite/std/time/month/2.cc: New test for extreme values.
* testsuite/std/time/weekday/1.cc: Add constexpr tests against overflow.
* testsuite/std/time/weekday/2.cc: New test for extreme values.
---
Good for trunk?

Changes with respect to previous versions:
  v5: Fix typos in commit message.
  v4: Remove UB also from operator-.
  v3: Fix screwed up email send with v2.
  v2: Replace _T with _Tp and _U with _Up. Remove copyright+license from test.

 libstdc++-v3/include/std/chrono  | 79 +---
 libstdc++-v3/testsuite/std/time/month/1.cc   | 19 +
 libstdc++-v3/testsuite/std/time/month/2.cc   | 32 
 libstdc++-v3/testsuite/std/time/weekday/1.cc | 16 +++-
 libstdc++-v3/testsuite/std/time/weekday/2.cc | 32 
 5 files changed, 151 insertions(+), 27 deletions(-)
 create mode 100644 libstdc++-v3/testsuite/std/time/month/2.cc
 create mode 100644 libstdc++-v3/testsuite/std/time/weekday/2.cc

diff --git a/libstdc++-v3/include/std/chrono b/libstdc++-v3/include/std/chrono
index e4ba6eafceb..1b7cdb96e1c 100644
--- a/libstdc++-v3/include/std/chrono
+++ b/libstdc++-v3/include/std/chrono
@@ -501,18 +501,47 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION

 namespace __detail
 {
-  // Compute the remainder of the Euclidean division of __n divided by __d.
-  // Euclidean division truncates toward negative infinity and always
-  // produces a remainder in the range of [0,__d-1] (whereas standard
-  // division truncates toward zero and yields a nonpositive remainder
-  // for negative __n).
+  // Helper to __add_modulo and __sub_modulo.
+  template 
+  constexpr auto
+  __modulo_offset()
+  {
+   using _Up = make_unsigned_t<_Tp>;
+   auto constexpr __a = _Up(-1) - _Up(255 + __d - 2);
+   auto constexpr __b = _Up(__d * (__a / __d) - 1);
+   // Notice: b <= a - 1 <= _Up(-1) - (255 + d - 1) and b % d = d - 1.
+   return _Up(-1) - __b; // >= 255 + d - 1
+  }
+
+  // Compute the remainder of the Euclidean division of __x + __y divided 
by
+  // __d without overflowing.  Typically, __x <= 255 + d - 1 is sum of
+  // weekday/month with a shift in [0, d - 1] and __y is a duration count.
+  template 
+  constexpr unsigned
+  __add_modulo(unsigned __x, _Tp __y)
+  {
+   using _Up = make_unsigned_t<_Tp>;
+   // For __y >= 0, _Up(__y) has the same mathematical value as __y and
+   // this function simply returns (__x + _Up(__y)) % d.  Typically, this
+   // doesn't overflow since the range of _Up contains many more positive
+   // values than _Tp's.  For __y < 0, _Up(__y) has a mathematical value in
+   // the upper-half range of _Up so that adding a positive value to it
+   // might overflow.  Moreover, most likely, _Up(__y) != __y

[RFC 0/1] libstdc++: Replace Ryu with Teju Jagua.

2024-08-04 Thread Cassio Neri
Dear libstdc++ developers,

I developed a novel algorithm called Teju Jagua which finds the shortest
representation of a floating-point number. It can be used by std::to_chars
which currently uses Ryu.

IMHO, Teju Jagua is much simpler than Ryu and, according to my measurements,
for doubles, Teju Jagua is twice faster than Ryu. Similarly, it's simpler
and ~27% faster than Dragonbox which, to the best of my knowledge, was the
fastest algorithm for this task until now.

Teju Jagua can bring another benefit. All the aforementioned algorithms use
look-up tables that often raise memory footprint concerns. At least in their
reference implementations, one or more tables is provided for each
floating-point type. I don't know about the others but for Teju Jagua the
table for a type can be also used for any smaller type. Hence, it just
needs to provide the table for the largest type (e.g., float128_t) of the
platform.

I already started an implementation in libstdc++ and will send a subsequent
RFC to gather feedback and to give you a better understanding of this work.
The first RFC will replace Ryu only for floats. After this is reviewed and
if there's still interest from others, I will send subsequent patches to
replace Ryu for each one of the other floating-point types. In the end I'll
send a patch to remove Ryu. (Note: This concerns only regular Ryu, which
finds the shortest representation, Ryu-printf that produces the result for a
given precision is out of the scope of this work and will remain.)

A reference C implementation (it's a WIP which lacks documentation) is found
in [1].

I presented this algorithm in C++ Now 2024, C++ on Sea 2024 and I'll show
it again at CppCon 2024. If any libstdc++ developer plans to attend, I'd be
more than happy to discuss this work in person at the conference. The video
recording of my C++ Now 2024 talk is on YouTube [2].

[1] https://github.com/cassioneri/teju_jagua:
[2] https://www.youtube.com/watch?v=w0WrRdW7eqg


[RFC] libstdc++: Replace Ryu with Teju Jagua for float.

2024-08-05 Thread Cassio Neri
Implement the template function teju_jagua which finds the shortest
representation of a floating-point number. The floating-point type is a
template parameter and the implementation is generic enough to handle all
floating-point types of interest, namely, IEEE 754, std::bfloat16_t,
x86 80-bit and IBM128.

The function uses a few helpers and traits classes to handle the
different floating-point types. This patch only provides those required
for float and shows the pattern to be followed by subsequent patches
that will handle the remaining floating-point types.

Add new static data to the existent struct floating_type_traits
which are required by teju_jagua. Some of these duplicate information
that is already there and used by Ryu. A subsequent clean-up patch will
remove duplications once the shortest representation for all types is
found by teju_jagua.

libstdc++-v3/ChangeLog:

* src/c++17/floating_to_chars.cc: implement teju_jagua and
helpers (for float).
* src/c++17/floating_to_chars_tables.h: tables with constants
required at runtime.
---
 libstdc++-v3/src/c++17/floating_to_chars.cc   | 321 +++-
 .../src/c++17/floating_to_chars_tables.h  | 716 ++
 2 files changed, 1027 insertions(+), 10 deletions(-)
 create mode 100644 libstdc++-v3/src/c++17/floating_to_chars_tables.h

diff --git a/libstdc++-v3/src/c++17/floating_to_chars.cc 
b/libstdc++-v3/src/c++17/floating_to_chars.cc
index 2c9da977cd9..b6adbf9cf5f 100644
--- a/libstdc++-v3/src/c++17/floating_to_chars.cc
+++ b/libstdc++-v3/src/c++17/floating_to_chars.cc
@@ -96,6 +96,8 @@ using F128_type = void;

 #include 

+#include "floating_to_chars_tables.h"
+
 namespace
 {
 #if defined __SIZEOF_INT128__
@@ -132,6 +134,26 @@ namespace
 { return generic128::generic_to_chars(v, result); }
   } // namespace ryu

+  // Binary or decimal field representation of a positive floating-point 
number.
+  // The base is implicitly given by the context and the represented number is:
+  //   mantissa * base ^ exponent.
+  template 
+  struct fields {
+Mantissa mantissa;
+int32_t  exponent;
+  };
+
+  // Binary or decimal field representation of a floating-point number.
+  // The base is implicitly given by the context and the represented number is:
+  //   (sign ? -1 : 1) * mantissa * base ^ exponent.
+  template 
+  struct signed_fields;
+
+  // TODO (CN): Remove.
+  template <>
+struct signed_fields : ryu::floating_decimal_32
+{ };
+
   // A traits class that contains pertinent information about the binary
   // format of each of the floating-point types we support.
   template
@@ -149,6 +171,12 @@ namespace

   static constexpr uint64_t pow10_adjustment_tab[]
= { 0b00011101011100110101100101101110 
};
+
+  using fields_t = signed_fields;
+
+  static constexpr uint32_t mantissa_size=   23;
+  static constexpr uint32_t exponent_size=8;
+  static constexpr int32_t  exponent_minimum = -149;
 };

   template<>
@@ -430,6 +458,276 @@ namespace
   bool sign;
 };

+  // Return 2^k.
+  template 
+constexpr T
+pow2(const uint32_t k)
+{
+  return T(1) << k;
+}
+
+  // Return the k least significant bits of n (i.e., e % 2^k.)
+  template 
+T
+lsb(const T n, const uint32_t k)
+{
+  return n % pow2(k);
+}
+
+  // Return the binary field representation of an IEEE Float value.
+  template >
+typename Traits::fields_t
+to_binary_ieee(const Float value)
+{
+  using mantissa_t = typename Traits::mantissa_t;
+
+  mantissa_t bits;
+  memcpy(&bits, &value, sizeof(value));
+
+  const mantissa_t mantissa_bits = lsb(bits, Traits::mantissa_size);
+  bits >>= Traits::mantissa_size;
+
+  const int32_t exponent_bits = lsb(bits, Traits::exponent_size);
+  bits >>= Traits::exponent_size;
+
+  const bool negative = bits != 0;
+
+  int32_texponent = int32_t(exponent_bits) + Traits::exponent_minimum;
+  mantissa_t mantissa = mantissa_bits;
+
+  if (exponent_bits != 0)
+  {
+   // Number is normal.
+   exponent -= 1;
+   mantissa += pow2(Traits::mantissa_size);
+  }
+
+  return { mantissa, exponent, negative } ;
+}
+
+  // Return the binary field representation of a float value.
+  typename floating_type_traits::fields_t
+  to_binary(const float value)
+  {
+return to_binary_ieee(value);
+  }
+
+  // Check whether m is multiple of 2^k.
+  template 
+bool
+is_multiple_of_pow2(const Mantissa m, const uint32_t k)
+{
+  return lsb(m, k) == 0;
+}
+
+  // Check whether the binary field representation of a number is that of a
+  // small integer.
+  template 
+bool
+is_small_integer(const Mantissa m, const int32_t e)
+{
+  return -static_cast(Traits::mantissa_size) <= e && e <= 0 &&
+   is_multiple_of_pow2(m, -e);
+}
+
+  // Return the quotient n / 10.
+  uint32_t
+  div1

Re: [RFC] libstdc++: Replace Ryu with Teju Jagua for float.

2024-08-07 Thread Cassio Neri
On top of what Jonathan said, Teju Jagua yields a smaller binary size since
it needs fewer look-up tables.

Regarding performance and correctness:

1) My patch passes all the tests.

2) The GitHub project of the reference implementation (*) scans all floats
and the results of Teju Jagua, Ryu and Dragonbox match.

3) The project (*) benchmarks ran on my PC show performance improvements.
For doubles, Teju Jagua 98% faster than Ryu. For floats the speed up is
about 50%.

I'll provide easier to run benchmark programs for, potentially, being added
to gcc's repo.

4) I have notes proving the mathematical correctness of the algorithm which
I will turn into a paper to be uploaded to the arxiv and submitted to a
peer-reviewed journal.
(*) I'll soon document the project for others to run tests and benchmarks.

Best regards,
Cassio.

On Tue, 6 Aug 2024, 18:10 Jonathan Wakely,  wrote:

>
>
> On Tue, 6 Aug 2024, 17:28 Andi Kleen,  wrote:
>
>> Cassio Neri  writes:
>>
>> > Implement the template function teju_jagua which finds the shortest
>> > representation of a floating-point number. The floating-point type is a
>> > template parameter and the implementation is generic enough to handle
>> all
>> > floating-point types of interest, namely, IEEE 754, std::bfloat16_t,
>> > x86 80-bit and IBM128.
>>
>> So the only benefit is performance, right?
>
>
> The functions should be as fast as possible.
>
> The algorithm is also simpler though.
>
> So the patch
>> should come with some performance numbers how it is better
>> than the old code. Also how did you validate that it works
>> correctly?
>>
>
> See the talks linked to in
> https://gcc.gnu.org/pipermail/gcc-patches/2024-August/659362.html
>
>
>
>> -Andi
>>
>>


[PATCH v1] libstdc++: More efficient weekday from year_month_day.

2025-05-05 Thread Cassio Neri
Use a better algorithm to calculate n % 7, taking into account that the divisor
is a Mersenne number. This is explained in this two-part lightning talk at
CppCon 2024 [1, 2].

Roughly speaking, the algorithm is used for n = sys_days but it has a
precondition that doesn't hold for all values of sys_days. However, it does hold
for every value coming out of year_month_day::operator sys_days().

Nevertheless, there's an issue because the conversion is done in two steps:
year_month_day -> sys_days -> weekday and in the second step the information
that the input sys_days was obtained from the first step (so the precondition
holds) is lost. In the talk a single function performs the two steps and uses
the optimization. Unless the standard adds a similar function (e.g. a weekday
constructor) or this is added as an extension (not done here), this is not an
option for libstdc++.

The issue is adressed in the following way. At the end of the fist step, the
code informs the compiler, through an [[assume]] attribute that the precondition
holds. Then, at the beginning of the second step, the code checks, through
__builtin_constant_p, if the precondition holds, in which case, it uses the
better algorithm or, otherwise, it falls back to n % 7.

The idea is illustrated in [3] which compares code generation (with and without
the optimization) and provides an exhaustive test of correctness. A benchmark is
shown in [4].

References:

[1] https://youtu.be/64mTEXnSnZs
[2] https://youtu.be/bnVkWEjRNeI
[3] https://godbolt.org/z/E14asPK8r
[4] https://quick-bench.com/q/vzxlrLaEccner3h-ahIrbB0BZvE

libstdc++-v3/ChangeLog:

* include/std/chrono: Add postcondition to year_month_date::operator
sys_days and check this condition in weekday::_S_from_days.
---
The approach might be somewhat controversial and might need to be discussed. 
The problem is that passing information from [[assume]] to __builtin_constant_p 
is
quite fragile in 3 different ways.

1) The distance in source code between [[assume]] and __builtin_constant_p makes
hard to understand why they are there in the first place. A reader who doesn't
know the full picture might feel tempted to remove either of these lines.

2) Logically, the post-condition expressed by [[assume]] only needs to imply the
condition tested by __builtin_constant_p. However, even whey they are obviously
equivalent for a human, this might not be the case for the compiler. Even
slight cosmetic changes might break the compiler's chain of thought and thus,
the optimization.

3) Similarly to 2, changes in the compiler code that handles [[assume]] or 
__builtin_constant_p might break the optimization.

If any of the issues above happens, the result will still be right but a
performance regression will appear. Comments can mitigate the issues but a
performance test or a test for code generation would be more effective to
prevent performance regressions. Unfortunately, I don't know how either of these
can be done.

 libstdc++-v3/include/std/chrono | 27 +--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/libstdc++-v3/include/std/chrono b/libstdc++-v3/include/std/chrono
index 8eb9fd9baac..e762dd46cc4 100644
--- a/libstdc++-v3/include/std/chrono
+++ b/libstdc++-v3/include/std/chrono
@@ -984,7 +984,20 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   static constexpr weekday
   _S_from_days(const days& __d)
   {
-   return weekday{__detail::__add_modulo<7>(4, __d.count())};
+   const auto __days = __d.count();
+
+   const auto __n = static_cast(__days) + 12'687'434u;
+   const auto __use_fast_mod7 = __n < 178'956'973u;
+   // This condition is true when __d = __dp.time_since_epoch() where __dp
+   // comes from year_month_day::operator sys_days().
+   if (__builtin_constant_p(__use_fast_mod7))
+   {
+ const auto __wd = 613'566'757u * __n >> 29;
+ [[assume(__wd != 7)]];
+ return weekday{__wd};
+   }
+
+   return weekday{__detail::__add_modulo<7>(4, __days)};
   }
 
 public:
@@ -1652,7 +1665,17 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 
   constexpr
   operator sys_days() const noexcept
-  { return sys_days{_M_days_since_epoch()}; }
+  {
+   const auto __days = _M_days_since_epoch();
+
+   // The assume attribute below allows weekday::_S_from_days to use a
+   // faster algorithm for modulus 7.
+   const auto __n = static_cast(__days.count()) + 12'687'434u;
+   const auto __use_fast_mod7 = __n < 178'956'973u;
+   [[assume(__use_fast_mod7)]];
+
+   return sys_days{__days};
+  }
 
   explicit constexpr
   operator local_days() const noexcept
-- 
2.49.0



[PATCH v2] libstdc++: More efficient weekday from year_month_day.

2025-05-10 Thread Cassio Neri
Use a better algorithm to calculate n % 7, taking into account that the divisor
is a Mersenne number. This is explained in this two-part lightning talk at
CppCon 2024 [1, 2].

Roughly speaking, the algorithm is used for n = sys_days but it has a
precondition that doesn't hold for all values of sys_days. However, it does hold
for every value coming out of year_month_day::operator sys_days().

Nevertheless, there's an issue because the conversion is done in two steps:
year_month_day -> sys_days -> weekday and in the second step the information
that the input sys_days was obtained from the first step (so the precondition
holds) is lost. In the talk a single function performs the two steps and uses
the optimization. Unless the standard adds a similar function (e.g. a weekday
constructor) or this is added as an extension (not done here), this is not an
option for libstdc++.

The issue is adressed in the following way. At the end of the fist step, the
code informs the compiler, through an [[assume]] attribute that the precondition
holds. Then, at the beginning of the second step, the code checks, through
__builtin_constant_p, if the precondition holds, in which case, it uses the
better algorithm or, otherwise, it falls back to n % 7.

The idea is illustrated in [3] which compares code generation (with and without
the optimization) and provides an exhaustive test of correctness. A benchmark is
shown in [4].

References:

[1] https://youtu.be/64mTEXnSnZs
[2] https://youtu.be/bnVkWEjRNeI
[3] https://godbolt.org/z/TMMcKj8Gv
[4] https://quick-bench.com/q/VPtEBjGDmmB4QEG2IDjIvfU6WzA

libstdc++-v3/ChangeLog:

* include/std/chrono: Add postcondition to year_month_date::operator
sys_days and check this condition in weekday::_S_from_days.
* testsuite/std/time/weekday/1.cc: Add tests for special values.
---
The approach might be somewhat controversial and might need to be discussed. The
problem is that passing information from [[assume]] to __builtin_constant_p is
quite fragile in 3 different ways.

1) The distance in source code between [[assume]] and __builtin_constant_p makes
hard to understand why they are there in the first place. A reader who doesn't
know the full picture might feel tempted to remove either of these lines.

2) Even cosmetic changes in the conditions of [[assume]] or 
__builtin_constant_p might break the compiler's chain of thought and thus, the 
optimization.

3) Similarly to 2, changes in the compiler code that handles [[assume]] or 
__builtin_constant_p might break the optimization.

If any of the issues above happens, the result will still be right but a
performance regression will appear. Comments can mitigate the issues but a
performance test or a test for code generation would be more effective to
prevent performance regressions. Unfortunately, I don't know how either of these
can be done.

Version history:
 v2. Fix bug in v1, add tests and improve comments.
 v1. Initial patch.
 
 libstdc++-v3/include/std/chrono  | 29 ++--
 libstdc++-v3/testsuite/std/time/weekday/1.cc | 19 +
 2 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/libstdc++-v3/include/std/chrono b/libstdc++-v3/include/std/chrono
index 8eb9fd9baac..b747f46c13f 100644
--- a/libstdc++-v3/include/std/chrono
+++ b/libstdc++-v3/include/std/chrono
@@ -984,7 +984,22 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   static constexpr weekday
   _S_from_days(const days& __d)
   {
-   return weekday{__detail::__add_modulo<7>(4, __d.count())};
+   const auto __days = __d.count();
+   using _Up = make_unsigned_t;
+   const auto __n = static_cast<_Up>(__days) + 12'687'434u;
+
+   // This condition holds when __d = sys_days(__ymd).time_since_epoch(),
+   // where __ymd's type is year_month_day.
+   const auto __use_fast_mod7 = __n < 178'956'973u;
+
+   if (__builtin_constant_p(__use_fast_mod7) && __use_fast_mod7)
+   {
+ // https://youtu.be/64mTEXnSnZs and https://youtu.be/bnVkWEjRNeI
+ const auto __wd = 613'566'757u * static_cast(__n) >> 29;
+ [[assume(__wd != 7)]];
+ return weekday{__wd};
+   }
+   return weekday{__detail::__add_modulo<7>(4, __days)};
   }
 
 public:
@@ -1652,7 +1667,17 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 
   constexpr
   operator sys_days() const noexcept
-  { return sys_days{_M_days_since_epoch()}; }
+  {
+   const auto __days = _M_days_since_epoch();
+
+   // The assume attribute below allows weekday::_S_from_days to use a
+   // faster algorithm for modulus 7.
+   const auto __n = static_cast(__days.count()) + 12'687'434u;
+   const auto __use_fast_mod7 = __n < 178'956'973u;
+   [[assume(__use_fast_mod7)]];
+
+   return sys_days{__days};
+  }
 
   explicit constexpr
   operator local_days() const noexcept
diff --git a/libstdc++-v3/testsuite/std/time/weekday/1.cc 
b/libstdc++-v3/testsuite/std/time/weekday/1.cc
i

Re: [PATCH v1] libstdc++: More efficient weekday from year_month_day.

2025-05-11 Thread Cassio Neri
Hi all,

After reflecting on my previous message and Andrew's, I now believe this
patch is not the best solution to optimise the day of the week. Instead,
the optimisation for n % 7 should be done by the compiler depending on the
platform.

I'll open a missing optimisation opportunity bug report against GCC with my
suggestions.

Therefore, expecting a better solution from the compiler, I'd ask you to
disregard this patch.

Best wishes,
Cassio.

On Sun, 11 May 2025 at 01:59, Cassio Neri  wrote:

> Thanks Andrew for your prompt reply.
>
> The results below regard my PoC which is as close to the proposed patch as
> I could make. This is because I can't have chrono with my patch on godbold
> for a comparison between current chrono and patched chrono.
>
> I tried on all platforms that I could make it to compile. Please double
> check everything because I might be misreading some results, especially on
> the platforms that I'm not familiar with. Sometimes godbold seems to have
> issues and cut pieces of the generated assembly from the output. I've
> marked these cases with (X).
>
> I suspect we want to disable this for -Os
>>
>
> Below are the sizes with -Os. Most of the time the new code is shorter
> than the old one with a few exceptions where they are the same size
> (because the platform doesn't seem to support [[assume]]). The new code is
> never longer. On each link, the middle panel shows the result for the old
> code and the right panel for the new code. These panels have tabs for
> different platforms.
>
> Old   New
>
> https://godbolt.org/z/hfz9szEWf
>
> x86-64 0x81  0x69
> ARM32  0x78  0x68
> ARM64  0x81  0x71
> ARM64 Morello  0x48  0x48
> HPPA   0xf8  0xc8
> KVX ACB0xec  0xcc
> loongarch640x94  0x8c  (x)
>
> https://godbolt.org/z/eMfzoPhT5
>
> M68K   0xb6  0xa6
> MinGW  0xa0  0x80  (X)
> mips   0xdc  0xac
> mips64 0xcc  0xb8
> mipls64 (el)   0xbc  0xa8
> mipsel 0xe0  0xb0
> MRISC320xa4  0x74
> power  0xb8  0x80
>
> https://godbolt.org/z/PjqbTqK6b
>
> power640xa8  0x8c
> power64le  0xa4  0x88
> RISC-V (32)0x90  0x7e  (X)
> RISC-V (64)0x86  0x86  (X)
> s390x  0xf0  0x90
> sh 0xc2  0xb2
> SPARC  0xc0  0x98
> SPARC LEON 0xbc  0x94
>
> https://godbolt.org/z/7oebGMYTM
>
> SPARC640xac  0x94
> TI C6x 0xc4  0x98
> Tricore0xb0  0xb0
> VAX0xc8  0xc5
>
>
>> And plus i am not 100% convinced it is best for all micro-architures.
>> Especially on say aarch64.
>> Can you do more benchmarking and peocide which exaxt core is being used?
>>
>
> I don't have access to any platform other than x86-64 to do benchmarks :-(
>
> And mention the size difference too?
>>
>
> Same exercise explained above but with -O2:
>
>  OldNew
> https://godbolt.org/z/eqGo9xnz3
>
> x86-64 0x a4  0x 72
> ARM32  0x a8  0x 74
> ARM64  0x 98  0x 80
> ARM64 Morello  0x14c  0x14c
> HPPA   0x134  0x c8
> KVX ACB0x f4  0x 98
> loongarch640x ac  0x 9c  (X)
>
> https://godbolt.org/z/7qh94zGMK
>
> M68K   0x13a  0x a2
> MinGW  0x d0  0x 80  (X)
> mips   0x11c  0x e4
> mips64 0x130  0x f0
> mipls64 (el)   0x120  0x e0
> mipsel 0x120  0x e8
> MRISC320x a0  0x 74
> power  0x dc  0x 88
>
> https://godbolt.org/z/Y11Trnqc1
>
> power640x d0  0x 94
> power64le  0x d0  0x 90
> RISC-V (32)0x bc  0x 84  (X)
> RISC-V (64)0x be  0x 94  (X)
> s390x  0x f0  0x a8
> sh 0x c6  0x cc  (*)
> SPARC  0x108  0x 9c
> SPARC LEON 0x f4  0x 94
>
> https://godbolt.org/z/h456PTEWh
>
> SPARC640x c0  0x a0
> TI C6x 0x108  0x b0
> Tricore0x e8  0x ea  (*)
> VAX0x dc  0x dc
>
> (*) These are the only cases where the new code is larger than the old one.
>
> Plus gcc knows how to do %7 via multiplication is that being used or is it
>> due to generic x86 tuning it is using the div instruction?
>>
>
> Yes and no. In x86-64 (and probably many other platforms) the current
> optimisation for n % 7 is a byproduct of the optimisation for /, that is,
> to calculate n % 7, the generated code evaluates n - (n / 7) * 7. The
> quotient q = n / 7 is optimised to avoid div and uses a multiplication and
> other cheaper operations. In total it evaluates 2 multiplications +
> shifts + add + subs and movs. (One multiplication is q*7 which is performed
> with LEA + sub.) The algorithm that I'm suggesting, performs only one
> multiplication and one. Below are the comparisons of n % 7 and the proposed
> algorithm.
>
> https://godbolt.org/z/o7dazs4Gc
> https://godbolt.org/z/zP79736WK
> https://godbolt.org/z/65x7naMfq
> https://godbolt.org/z/z9ofaMzex
>
> I hope this helps.
> Cassio.
>
>


Re: [PATCH v1] libstdc++: More efficient weekday from year_month_day.

2025-05-10 Thread Cassio Neri
Thanks Andrew for your prompt reply.

The results below regard my PoC which is as close to the proposed patch as
I could make. This is because I can't have chrono with my patch on godbold
for a comparison between current chrono and patched chrono.

I tried on all platforms that I could make it to compile. Please double
check everything because I might be misreading some results, especially on
the platforms that I'm not familiar with. Sometimes godbold seems to have
issues and cut pieces of the generated assembly from the output. I've
marked these cases with (X).

I suspect we want to disable this for -Os
>

Below are the sizes with -Os. Most of the time the new code is shorter than
the old one with a few exceptions where they are the same size (because the
platform doesn't seem to support [[assume]]). The new code is never longer.
On each link, the middle panel shows the result for the old code and the
right panel for the new code. These panels have tabs for different
platforms.

Old   New

https://godbolt.org/z/hfz9szEWf

x86-64 0x81  0x69
ARM32  0x78  0x68
ARM64  0x81  0x71
ARM64 Morello  0x48  0x48
HPPA   0xf8  0xc8
KVX ACB0xec  0xcc
loongarch640x94  0x8c  (x)

https://godbolt.org/z/eMfzoPhT5

M68K   0xb6  0xa6
MinGW  0xa0  0x80  (X)
mips   0xdc  0xac
mips64 0xcc  0xb8
mipls64 (el)   0xbc  0xa8
mipsel 0xe0  0xb0
MRISC320xa4  0x74
power  0xb8  0x80

https://godbolt.org/z/PjqbTqK6b

power640xa8  0x8c
power64le  0xa4  0x88
RISC-V (32)0x90  0x7e  (X)
RISC-V (64)0x86  0x86  (X)
s390x  0xf0  0x90
sh 0xc2  0xb2
SPARC  0xc0  0x98
SPARC LEON 0xbc  0x94

https://godbolt.org/z/7oebGMYTM

SPARC640xac  0x94
TI C6x 0xc4  0x98
Tricore0xb0  0xb0
VAX0xc8  0xc5


> And plus i am not 100% convinced it is best for all micro-architures.
> Especially on say aarch64.
> Can you do more benchmarking and peocide which exaxt core is being used?
>

I don't have access to any platform other than x86-64 to do benchmarks :-(

And mention the size difference too?
>

Same exercise explained above but with -O2:

 OldNew
https://godbolt.org/z/eqGo9xnz3

x86-64 0x a4  0x 72
ARM32  0x a8  0x 74
ARM64  0x 98  0x 80
ARM64 Morello  0x14c  0x14c
HPPA   0x134  0x c8
KVX ACB0x f4  0x 98
loongarch640x ac  0x 9c  (X)

https://godbolt.org/z/7qh94zGMK

M68K   0x13a  0x a2
MinGW  0x d0  0x 80  (X)
mips   0x11c  0x e4
mips64 0x130  0x f0
mipls64 (el)   0x120  0x e0
mipsel 0x120  0x e8
MRISC320x a0  0x 74
power  0x dc  0x 88

https://godbolt.org/z/Y11Trnqc1

power640x d0  0x 94
power64le  0x d0  0x 90
RISC-V (32)0x bc  0x 84  (X)
RISC-V (64)0x be  0x 94  (X)
s390x  0x f0  0x a8
sh 0x c6  0x cc  (*)
SPARC  0x108  0x 9c
SPARC LEON 0x f4  0x 94

https://godbolt.org/z/h456PTEWh

SPARC640x c0  0x a0
TI C6x 0x108  0x b0
Tricore0x e8  0x ea  (*)
VAX0x dc  0x dc

(*) These are the only cases where the new code is larger than the old one.

Plus gcc knows how to do %7 via multiplication is that being used or is it
> due to generic x86 tuning it is using the div instruction?
>

Yes and no. In x86-64 (and probably many other platforms) the current
optimisation for n % 7 is a byproduct of the optimisation for /, that is,
to calculate n % 7, the generated code evaluates n - (n / 7) * 7. The
quotient q = n / 7 is optimised to avoid div and uses a multiplication and
other cheaper operations. In total it evaluates 2 multiplications + shifts
+ add + subs and movs. (One multiplication is q*7 which is performed with
LEA + sub.) The algorithm that I'm suggesting, performs only one
multiplication and one. Below are the comparisons of n % 7 and the proposed
algorithm.

https://godbolt.org/z/o7dazs4Gc
https://godbolt.org/z/zP79736WK
https://godbolt.org/z/65x7naMfq
https://godbolt.org/z/z9ofaMzex

I hope this helps.
Cassio.


Re: [PATCH v1] libstdc++: More efficient weekday from year_month_day.

2025-05-13 Thread Cassio Neri
Adding link for xref: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120264
Thanks, Cassio.

On Mon, 12 May 2025 at 10:06, Jonathan Wakely  wrote:

> On Sun, 11 May 2025 at 12:34, Cassio Neri  wrote:
> >
> > Hi all,
> >
> > After reflecting on my previous message and Andrew's, I now believe this
> patch is not the best solution to optimise the day of the week. Instead,
> the optimisation for n % 7 should be done by the compiler depending on the
> platform.
> >
> > I'll open a missing optimisation opportunity bug report against GCC with
> my suggestions.
> >
> > Therefore, expecting a better solution from the compiler, I'd ask you to
> disregard this patch.
>
> OK. Thank you both for working through this.
>
>


[PATCH 1/4] libstdc++: More efficient date from days.

2021-02-23 Thread Cassio Neri via Gcc-patches
This patch reimplements std::chrono::year_month_day::_S_from_days() which
retrieves a date from the number of elapsed days since 1970/01/01.  The new
implementation is based on Proposition 6.3 of Neri and Schneider, "Euclidean
Affine Functions and Applications to Calendar Algorithms" available at
https://arxiv.org/abs/2102.06959.

The aforementioned paper benchmarks the implementation against several
counterparts, including libc++'s (which is identical to the current
implementation).  The results, shown in Figure 4, indicate the new algorithm is
2.2 times faster than the current one.

The patch adds a test which loops through all integers in [-12687428, 11248737],
and for each of them, gets the corresponding date and compares the result
against its expected value.  The latter is calculated using a much simpler and
easy to understand algorithm but which is also much slower.

The interval used in the test covers the full range of values for which a
roundtrip must work [time.cal.ymd.members].  Despite its completeness the test
runs in a matter of seconds.

libstdc++-v3/ChangeLog:

* include/std/chrono:
* testsuite/std/time/year_month_day/3.cc: New test.
---
 libstdc++-v3/include/std/chrono   | 46 
 .../testsuite/std/time/year_month_day/3.cc| 71 +++
 2 files changed, 104 insertions(+), 13 deletions(-)
 create mode 100644 libstdc++-v3/testsuite/std/time/year_month_day/3.cc

diff --git a/libstdc++-v3/include/std/chrono b/libstdc++-v3/include/std/chrono
index 7840099d743..d224762fd3f 100644
--- a/libstdc++-v3/include/std/chrono
+++ b/libstdc++-v3/include/std/chrono
@@ -2429,22 +2429,42 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   // TODO: Implement operator<<, from_stream.
 };

-// Construct from days since 1970/01/01. Magic.
+// Construct from days since 1970/01/01.
+// Proposition 6.3 of Neri and Schneider, "Euclidean Affine
Functions and Applications to
+// Calendar Algorithms". https://arxiv.org/abs/2102.06959
 constexpr year_month_day
 year_month_day::_S_from_days(const days& __dp) noexcept
 {
-  const auto __z = __dp.count() + 719468;
-  const auto __era = (__z >= 0 ? __z : __z - 146096) / 146097;
-  const auto __doe = static_cast(__z - __era * 146097);
-  const auto __yoe
-= (__doe - __doe / 1460 + __doe / 36524 - __doe / 146096) / 365;
-  const auto __y = static_cast(__yoe) + __era * 400;
-  const auto __doy = __doe - (365 * __yoe + __yoe / 4 - __yoe / 100);
-  const auto __mp = (5 * __doy + 2) / 153;
-  const auto __d = __doy - (153 * __mp + 2) / 5 + 1;
-  const auto __m = __mp < 10 ? __mp + 3 : __mp - 9;
-  return year_month_day{chrono::year(__y + (__m <= 2)),
-chrono::month(__m), chrono::day(__d)};
+  constexpr auto __z2= static_cast(-1468000);
+  constexpr auto __r2_e3 = static_cast(536895458);
+
+  const auto __r0 = __dp.count() + __r2_e3;
+
+  const auto __n1 = 4 * __r0 + 3;
+  const auto __q1 = __n1 / 146097;
+  const auto __r1 = __n1 % 146097 / 4;
+
+  constexpr auto __p32 = static_cast(1) << 32;
+  const auto __n2 = 4 * __r1 + 3;
+  const auto __u2 = static_cast(2939745) * __n2;
+  const auto __q2 = static_cast(__u2 / __p32);
+  const auto __r2 = static_cast(__u2 % __p32) / 2939745 / 4;
+
+  constexpr auto __p16 = static_cast(1) << 16;
+  const auto __n3 = 2141 * __r2 + 197913;
+  const auto __q3 = __n3 / __p16;
+  const auto __r3 = __n3 % __p16 / 2141;
+
+  const auto __y0 = 100 * __q1 + __q2;
+  const auto __m0 = __q3;
+  const auto __d0 = __r3;
+
+  const auto __j  = __r2 >= 306;
+  const auto __y1 = __y0 + __j;
+  const auto __m1 = __j ? __m0 - 12 : __m0;
+  const auto __d1 = __d0 + 1;
+
+  return year_month_day{chrono::year{__y1 + __z2},
chrono::month{__m1}, chrono::day{__d1}};
 }

 // Days since 1970/01/01. Magic.
diff --git a/libstdc++-v3/testsuite/std/time/year_month_day/3.cc
b/libstdc++-v3/testsuite/std/time/year_month_day/3.cc
new file mode 100644
index 000..eede649cd54
--- /dev/null
+++ b/libstdc++-v3/testsuite/std/time/year_month_day/3.cc
@@ -0,0 +1,71 @@
+// { dg-options "-std=gnu++2a" }
+// { dg-do run { target c++2a } }
+
+// Copyright (C) 2020-2021 Free Software Foundation, Inc.
+//
+// This file is part of the GNU ISO C++ Library.  This library is free
+// software; you can redistribute it and/or modify it under the
+// terms of the GNU General Public License as published by the
+// Free Software Foundation; either version 3, or (at your option)
+// any later version.
+
+// This library is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+// GNU General Public License for more details.
+
+// You should have received a copy of the GNU General Public License along
+// with this library; see the file COP

[PATCH 2/4] libstdc++: More efficient days from date.

2021-02-23 Thread Cassio Neri via Gcc-patches
This patch reimplements std::chrono::year_month_day::_M_days_since_epoch()
which calculates the number of elapsed days since 1970/01/01.  The new
implementation is based on Proposition 6.2 of Neri and Schneider, "Euclidean
Affine Functions and Applications to Calendar Algorithms" available at
https://arxiv.org/abs/2102.06959.

The aforementioned paper benchmarks the implementation against several
counterparts, including libc++'s (which is identical to the current
implementation).  The results, shown in Figure 3, indicate the new algorithm is
1.7 times faster than the current one.

The patch adds a test which loops through all dates in [-32767/01/01,
32767/12/31], and for each of them, gets the number of days and compares the
result against its expected value. The latter is calculated using a much
simpler and easy to understand algorithm but which is also much slower.

The dates used in the test covers the full range of possible values
[time.cal.year.members].  Despite its completeness the test runs in matter of
seconds.

libstdc++-v3/ChangeLog:

* include/std/chrono:
* testsuite/std/time/year_month_day/4.cc: New test.
---
 libstdc++-v3/include/std/chrono   | 34 +
 .../testsuite/std/time/year_month_day/4.cc| 71 +++
 2 files changed, 92 insertions(+), 13 deletions(-)
 create mode 100644 libstdc++-v3/testsuite/std/time/year_month_day/4.cc

diff --git a/libstdc++-v3/include/std/chrono b/libstdc++-v3/include/std/chrono
index 7840099d743..30203540f36 100644
--- a/libstdc++-v3/include/std/chrono
+++ b/libstdc++-v3/include/std/chrono
@@ -2447,22 +2447,30 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 chrono::month(__m), chrono::day(__d)};
 }

-// Days since 1970/01/01. Magic.
+// Days since 1970/01/01.
+// Proposition 6.2 of Neri and Schneider, "Euclidean Affine
Functions and Applications to
+// Calendar Algorithms". https://arxiv.org/abs/2102.06959
 constexpr days
 year_month_day::_M_days_since_epoch() const noexcept
 {
-  const auto __y = static_cast(_M_y) - (_M_m <= February);
-  const auto __m = static_cast(_M_m);
-  const auto __d = static_cast(_M_d);
-  const auto __era = (__y >= 0 ? __y : __y - 399) / 400;
-  // Year of "era" [0, 399].
-  const auto __yoe = static_cast(__y - __era * 400);
-  // Day of year [0, 365].
-  const auto __doy = (153 * (__m > 2 ? __m - 3 : __m + 9) + 2) /
5 + __d - 1;
-  // Day of "era" [0, 146096].
-  const auto __doe = __yoe * 365 + __yoe / 4 - __yoe / 100 + __doy;
-  const auto __days = __era * 146097 + static_cast(__doe) - 719468;
-  return days{__days};
+  auto constexpr __z2= static_cast(-1468000);
+  auto constexpr __r2_e3 = static_cast(536895458);
+
+  const auto __y1 = static_cast(static_cast(_M_y)) - __z2;
+  const auto __m1 = static_cast(_M_m);
+  const auto __d1 = static_cast(_M_d);
+
+  const auto __j  = static_cast(__m1 < 3);
+  const auto __y0 = __y1 - __j;
+  const auto __m0 = __j ? __m1 + 12 : __m1;
+  const auto __d0 = __d1 - 1;
+
+  const auto __q1 = __y0 / 100;
+  const auto __yc = 1461 * __y0 / 4 - __q1 + __q1 / 4;
+  const auto __mc = (979 *__m0 - 2919) / 32;
+  const auto __dc = __d0;
+
+  return days{static_cast(__yc + __mc + __dc - __r2_e3)};
 }

 // YEAR_MONTH_DAY_LAST
diff --git a/libstdc++-v3/testsuite/std/time/year_month_day/4.cc
b/libstdc++-v3/testsuite/std/time/year_month_day/4.cc
new file mode 100644
index 000..2194af17775
--- /dev/null
+++ b/libstdc++-v3/testsuite/std/time/year_month_day/4.cc
@@ -0,0 +1,71 @@
+// { dg-options "-std=gnu++2a" }
+// { dg-do run { target c++2a } }
+
+// Copyright (C) 2020-2021 Free Software Foundation, Inc.
+//
+// This file is part of the GNU ISO C++ Library.  This library is free
+// software; you can redistribute it and/or modify it under the
+// terms of the GNU General Public License as published by the
+// Free Software Foundation; either version 3, or (at your option)
+// any later version.
+
+// This library is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+// GNU General Public License for more details.
+
+// You should have received a copy of the GNU General Public License along
+// with this library; see the file COPYING3.  If not see
+// .
+
+// Class year_month_day [time.cal.year_month_day]
+
+#include 
+#include 
+
+// Slow but very clear way of advancing one day.
+constexpr void
+advance(std::chrono::year_month_day& ymd) noexcept {
+
+  using namespace std::chrono;
+
+  auto y = ymd.year();
+  auto m = ymd.month();
+  auto d = ymd.day();
+
+  if (d != year_month_day_last{year{y}, month_day_last{m}}.day())
+++d;
+  else {
+d = day{1};
+if (m != December)
+  ++m;
+else {
+  m = January;
+  ++y;
+}
+  }
+  ym

[PATCH 3/4] libstdc++: More efficient is_leap.

2021-02-23 Thread Cassio Neri via Gcc-patches
This patch reimplements std::chrono::year::is_leap().  Leap year check is
ubiquitously implemented (including here) as:

y % 4 == 0 && (y % 100 != 0 || y % 400 == 0).

The rationale being that testing divisibility by 4 first implies an earlier
return for 75% of the cases, therefore, avoiding the needless calculations of
y % 100 and y % 400. Although this fact is true, it does not take into account
the cost of branching.  This patch, instead, tests divisibility by 100 first:

(y % 100 != 0 || y % 400 == 0) && y % 4 == 0.

It is certainly counterintuitive that this could be more efficient since among
the three divisibility tests (4, 100 and 400) the one by 100 is the only one
that can never provide a definitive answer and a second divisibility test (by 4
or 400) is always required. However, measurements [1] in x86_64 suggest this is
3x more efficient!  A possible explanation is that checking divisibility by 100
first implies a split in the execution path with probabilities of (1%, 99%)
rather than (25%, 75%) when divisibility by 4 is checked first.  This decreases
the entropy of the branching distribution which seems to help prediction.

Given that y belongs to [-32767, 32767] [time.cal.year.members], a more
efficient algorithm [2] to check divisibility by 100 is used (instead of
y % 100 != 0).  Measurements suggest that this optimization improves performance
by 20%.

The patch adds a test that exhaustively compares the result of this
implementation with the ubiquitous one for all y in [-32767, 32767]. Although
its completeness, the test completes in a matter of seconds.

References:
[1] https://stackoverflow.com/a/60646967/1137388
[2] https://accu.org/journals/overload/28/155/overload155.pdf#page=16

libstdc++-v3/ChangeLog:

* include/std/chrono:
* testsuite/std/time/year/2.cc: New test.
---
 libstdc++-v3/include/std/chrono   | 19 -
 libstdc++-v3/testsuite/std/time/year/2.cc | 52 +++
 2 files changed, 70 insertions(+), 1 deletion(-)
 create mode 100644 libstdc++-v3/testsuite/std/time/year/2.cc

diff --git a/libstdc++-v3/include/std/chrono b/libstdc++-v3/include/std/chrono
index 7840099d743..b777acdd4a2 100644
--- a/libstdc++-v3/include/std/chrono
+++ b/libstdc++-v3/include/std/chrono
@@ -1597,7 +1597,24 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION

   constexpr bool
   is_leap() const noexcept
-  { return _M_y % 4 == 0 && (_M_y % 100 != 0 || _M_y % 400 == 0); }
+  {
+// Testing divisibility by 100 first gives better performance, that is,
+// return (_M_y % 100 != 0 || _M_y % 400 == 0) && _M_y % 4 == 0;
+
+// It gets even faster if _M_y is in [-536870800, 536870999]
(which is the case here) and
+// _M_y % 100 is replaced by __is_multiple_of_100 below.
+
+// References:
+// [1] https://github.com/cassioneri/calendar
+// [2]
https://accu.org/journals/overload/28/155/overload155.pdf#page=16
+
+constexpr uint32_t __multiplier   = 42949673;
+constexpr uint32_t __bound= 42949669;
+constexpr uint32_t __max_dividend = 1073741799;
+constexpr uint32_t __offset   = __max_dividend / 2 / 100 * 100;
+const bool __is_multiple_of_100 = __multiplier * (_M_y +
__offset) < __bound;
+return (!__is_multiple_of_100 || _M_y % 400 == 0) && _M_y % 4 == 0;
+  }

   explicit constexpr
   operator int() const noexcept
diff --git a/libstdc++-v3/testsuite/std/time/year/2.cc
b/libstdc++-v3/testsuite/std/time/year/2.cc
new file mode 100644
index 000..e22101e305a
--- /dev/null
+++ b/libstdc++-v3/testsuite/std/time/year/2.cc
@@ -0,0 +1,52 @@
+// { dg-options "-std=gnu++2a" }
+// { dg-do run { target c++2a } }
+
+// Copyright (C) 2020-2021 Free Software Foundation, Inc.
+//
+// This file is part of the GNU ISO C++ Library.  This library is free
+// software; you can redistribute it and/or modify it under the
+// terms of the GNU General Public License as published by the
+// Free Software Foundation; either version 3, or (at your option)
+// any later version.
+
+// This library is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+// GNU General Public License for more details.
+
+// You should have received a copy of the GNU General Public License along
+// with this library; see the file COPYING3.  If not see
+// .
+
+// Class year [time.cal.year_month_day]
+
+#include 
+#include 
+
+// Slow but clear test for leap year.
+constexpr bool
+is_leap_year(const std::chrono::year& y) noexcept
+{
+  const int n = static_cast(y);
+  return n % 4 == 0 && (n % 100 != 0 || n % 400 == 0);
+}
+
+void test01()
+{
+  using namespace std::chrono;
+
+  year y{-32767};
+  while (y < year{32767}) {
+VERIFY( y.is_leap() ==  is_leap_year(y) );
+++y;
+  }
+
+  // One more for y = 32767.
+  VERIFY( year{3

[PATCH 4/4] libstdc++: More efficient last day of month.

2021-02-23 Thread Cassio Neri via Gcc-patches
This patch reimplements std::chrono::year_month_day_last:day() which yields the
last day of a particular month.  The current implementation uses a look-up table
implemented as an unsigned[12] array.  The new implementation instead
is based on
the fact that a month m in [1, 12], except for m == 2 (February), is
either 31 or
30 days long and m's length depends on two things: m's parity and whether m >= 8
or not. These two conditions are determined by the 0th and 3th bit of m and,
therefore, cheap and straightforward bit-twiddling can provide the right result.

Measurements in x86_64 [1] suggest a 10% performance boost.  Although this does
not seem to be huge, notice that measurements are done in hot L1 cache
conditions which might not be very representative of production runs. Also
freeing L1 cache from holding the look-up table might allow performance
improvements elsewhere.

References:
[1] https://github.com/cassioneri/calendar

libstdc++-v3/ChangeLog:

* include/std/chrono:
---
 libstdc++-v3/include/std/chrono | 23 +--
 1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/libstdc++-v3/include/std/chrono b/libstdc++-v3/include/std/chrono
index 7840099d743..35a7a5e4382 100644
--- a/libstdc++-v3/include/std/chrono
+++ b/libstdc++-v3/include/std/chrono
@@ -1269,9 +1269,6 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION

   inline constexpr unsigned __days_per_month[12]
 = { 31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31 };
-
-  inline constexpr unsigned __last_day[12]
-= { 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31 };
 }

 // DAY
@@ -2526,9 +2523,23 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   constexpr chrono::day
   day() const noexcept
   {
-if (!_M_mdl.ok() || (month() == February && _M_y.is_leap()))
-  return chrono::day{29};
-return chrono::day{__detail::__last_day[unsigned(month()) - 1]};
+const auto __m = static_cast(month());
+
+// Excluding February, the last day of month __m is either 30 or 31 or,
+// in another words, it is 30 + b = 30 | b, where b is in {0, 1}.
+
+// If __m in {1, 3, 4, 5, 6, 7}, then b is 1 if, and only if
__m is odd.
+// Hence, b = __m & 1 = (__m ^ 0) & 1.
+
+// If __m in {8, 9, 10, 11, 12}, then b is 1 if, and only if
__m is even.
+// Hence, b = (__m ^ 1) & 1.
+
+// Therefore, b = (__m ^ c) & 1, where c = 0, if __m < 8, or c = 1 if
+// __m >= 8, that is, c = __m >> 3.
+
+// The above mathematically justifies this implementation whose
+// performance does not depend on look-up tables being on the L1 cache.
+return chrono::day{__m != 2 ? ((__m ^ (__m >> 3)) & 1) | 30 :
_M_y.is_leap() ? 29 : 28};
   }

   constexpr
-- 
2.29.2


Re: [PATCH 1/4] libstdc++: More efficient date from days.

2021-02-25 Thread Cassio Neri via Gcc-patches
Hi Jonathan,

The issue is that I didn't cast __dp.count() to uint32_t:

-  const auto __r0 = __dp.count() + __r2_e3;
+  const auto __r0 = static_cast(__dp.count()) + __r2_e3;

The above would be a better fix. Indeed, __r0 belongs to [0, 2^32[ which allows
all arithmetics that follow to be performed on uint32_t values. For performance
this is better than using signed integers.

I wrongly assumed __dp.count() was int32_t which would make __r0 also uint32_t.
It turns out, it is int64_t so are __r0 and other expressions including
__y1 + __z2. Hence, we're losing a bit of performance. I'm glad the compiler
warned us. (Although I don't understand why I didn't get the warning.)

Thanks,
Cassio.

On Thu, Feb 25, 2021 at 11:56 AM Jonathan Wakely  wrote:
>
> On 24/02/21 17:28 +, Jonathan Wakely wrote:
> >On 23/02/21 13:24 +, Cassio Neri via Libstdc++ wrote:
> >>This patch reimplements std::chrono::year_month_day::_S_from_days() which
> >>retrieves a date from the number of elapsed days since 1970/01/01.  The new
> >>implementation is based on Proposition 6.3 of Neri and Schneider, "Euclidean
> >>Affine Functions and Applications to Calendar Algorithms" available at
> >>https://arxiv.org/abs/2102.06959.
> >>
> >>The aforementioned paper benchmarks the implementation against several
> >>counterparts, including libc++'s (which is identical to the current
> >>implementation).  The results, shown in Figure 4, indicate the new 
> >>algorithm is
> >>2.2 times faster than the current one.
> >>
> >>The patch adds a test which loops through all integers in [-12687428, 
> >>11248737],
> >>and for each of them, gets the corresponding date and compares the result
> >>against its expected value.  The latter is calculated using a much simpler 
> >>and
> >>easy to understand algorithm but which is also much slower.
> >>
> >>The interval used in the test covers the full range of values for which a
> >>roundtrip must work [time.cal.ymd.members].  Despite its completeness the 
> >>test
> >>runs in a matter of seconds.
> >>
> >>libstdc++-v3/ChangeLog:
> >>
> >>   * include/std/chrono:
> >>   * testsuite/std/time/year_month_day/3.cc: New test.
> >
> >Thanks! I'm committing this to trunk (it only changes new C++20
> >material so OK during stage 4 ... and anyway it's both faster and
> >better tested than the old code).
> >
> >I've tweaked it slightly to keep some lines below 80 columns, but no
> >changes except whitespace.
>
> I've made the attached tweak, to avoid a narrowing conversion from
> long to int. Tested x86_64-linux, committed to trunk.
>


[PATCH] libstdc++: More efficient std::chrono::year::leap.

2021-05-21 Thread Cassio Neri via Gcc-patches
Simple change to std::chrono::year::is_leap. If a year is multiple of 100,
then it's divisible by 400 if and only if it's divisible by 16. The latter
allows for better code generation.

Tested on x86_64-pc-linux-gnu.

libstdc++-v3/ChangeLog:

* include/std/chrono:

diff --git a/libstdc++-v3/include/std/chrono b/libstdc++-v3/include/std/chrono
index 4631a727d73..6221d02c1b3 100644
--- a/libstdc++-v3/include/std/chrono
+++ b/libstdc++-v3/include/std/chrono
@@ -1612,7 +1612,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
  constexpr uint32_t __offset   = __max_dividend / 2 / 100 * 100;
  const bool __is_multiple_of_100
   = __multiplier * (_M_y + __offset) < __bound;
- return (!__is_multiple_of_100 || _M_y % 400 == 0) && _M_y % 4 == 0;
+ return (!__is_multiple_of_100 || _M_y % 16 == 0) && _M_y % 4 == 0;
   }

   explicit constexpr


Re: [PATCH] libstdc++: More efficient std::chrono::year::leap.

2021-05-21 Thread Cassio Neri via Gcc-patches
I've checked the generated code and the compiler doesn't figure out
the logic. I added a comment to explain.

(Revised patch below and attached.)

Best wishes,
Cassio.

---

Simple change to std::chrono::year::is_leap. If a year is multiple of 100,
then it's divisible by 400 if and only if it's divisible by 16. The latter
allows for better code generation.

Tested on x86_64-pc-linux-gnu.

libstdc++-v3/ChangeLog:
libstdc++-v3/ChangeLog:

* include/std/chrono:

diff --git a/libstdc++-v3/include/std/chrono b/libstdc++-v3/include/std/chrono
index 4631a727d73..85aa0379432 100644
--- a/libstdc++-v3/include/std/chrono
+++ b/libstdc++-v3/include/std/chrono
@@ -1612,7 +1612,9 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 constexpr uint32_t __offset   = __max_dividend / 2 / 100 * 100;
 const bool __is_multiple_of_100
   = __multiplier * (_M_y + __offset) < __bound;
-return (!__is_multiple_of_100 || _M_y % 400 == 0) && _M_y % 4 == 0;
+// Usually we test _M_y % 400 == 0 but, when it's already known that
+// _M_y%100 == 0, then _M_y % 400==0 is equivalent to _M_y % 16 == 0.
+return (!__is_multiple_of_100 || _M_y % 16 == 0) && _M_y % 4 == 0;
   }

   explicit constexpr

On Fri, May 21, 2021 at 7:05 PM Koning, Paul  wrote:
>
>
>
> > On May 21, 2021, at 1:46 PM, Cassio Neri via Gcc-patches 
> >  wrote:
> >
> > Simple change to std::chrono::year::is_leap. If a year is multiple of 100,
> > then it's divisible by 400 if and only if it's divisible by 16. The latter
> > allows for better code generation.
>
> I wonder if the optimizer could be taught to do that.
>
> The change seems correct but it is very confusing; at the very least the 
> reasoning you gave should be stated in a comment on that check.
>
> paul
>
>
Simple change to std::chrono::year::is_leap. If a year is multiple of 100,
then it's divisible by 400 if and only if it's divisible by 16. The latter
allows for better code generation.

Tested on x86_64-pc-linux-gnu.

libstdc++-v3/ChangeLog:
libstdc++-v3/ChangeLog:

	* include/std/chrono:

diff --git a/libstdc++-v3/include/std/chrono b/libstdc++-v3/include/std/chrono
index 4631a727d73..85aa0379432 100644
--- a/libstdc++-v3/include/std/chrono
+++ b/libstdc++-v3/include/std/chrono
@@ -1612,7 +1612,9 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 	constexpr uint32_t __offset   = __max_dividend / 2 / 100 * 100;
 	const bool __is_multiple_of_100
 	  = __multiplier * (_M_y + __offset) < __bound;
-	return (!__is_multiple_of_100 || _M_y % 400 == 0) && _M_y % 4 == 0;
+	// Usually we test _M_y % 400 == 0 but, when it's already known that
+	// _M_y%100 == 0, then _M_y % 400==0 is equivalent to _M_y % 16 == 0.
+	return (!__is_multiple_of_100 || _M_y % 16 == 0) && _M_y % 4 == 0;
   }

   explicit constexpr


Re: [PATCH] libstdc++: More efficient std::chrono::year::leap.

2021-06-24 Thread Cassio Neri via Gcc-patches
Thanks Jonathan.

Here are some benchmarks (assembly in [1]):
https://quick-bench.com/q/jclBXmi4QLDcRMLuuVpxTUsFmQw

Unfortunately, quick-bench times out unless some implementations are
commented out. You can copy the code and run it locally (needs google
benchmark) to get the full picture.

I really like Ulrich Drepper's implementation (version 8). GCC 11 emmits
really good code and looking at previous versions this has been very
consistent. This implementation also highlights very clearly that my hack
to calculate __is_multiple_of_100 (as opposed to % 100 used in version 7)
saves a ror instruction with remarkable performance effects.

In my AMD Ryzen 7 box, his implementation doesn't seem to be the fastest.
Nevertheless, compared to other fast alternatives, performance differences
are small and results seem very much platform dependent. In my opinion, his
implementation is the best, especially, given recent compiler changes. My
reasoning follows.

My original motivation was contrasting 'year % 400 == 0' with 'year % 16'
as in versions 1 and 2:

  __is_multiple_of_100 ? year % 400 == 0 : year % 4 == 0; // version 1
  __is_multiple_of_100 ? year % 16 == 0 : year % 4 == 0; // version 2

It's fair to expect that version 2 won't be slower than version 1. However,
this is the case and by quite a big margin. The emitted instructions are
very reasonable and, I guess, the issue is the choice of registers. I've
reimplemented half of version 2 in inline asm using similar register
choices as version 1 and got the performance boost that I was expecting.
(By no means I'm suggesting a non portable implementation here. It was just
an exercise to make my point. Also, it performs very poorly when compiled
by clang.)

The point here is that code emitted for sources like versions 1 and 2
(involving %) seems sensitive to very small variations and has been
changing across compiler releases in the last 3 years or so. (This can be
checked on godbolt.) Clang also seems very confused [2]. Even if the
emitted code for version 2 and others were good today, I fear a new
compiler release could regress. The stability of the code emitted for
Ulrich Drepper's implementation is much more reliable, its performance is
very good and its clarity is only compromised by my own hack for
__is_multiple_of_100. (Recall, however, that this hack is crucial for his
implementation's performance.)

Best wishes,
Cassio.

[1] https://godbolt.org/z/TbGYEqx33
[2] https://bugs.llvm.org/show_bug.cgi?id=50334

On Wed, Jun 23, 2021 at 6:52 PM Jonathan Wakely  wrote:

> On 23/06/21 18:51 +0100, Jonathan Wakely wrote:
> >Here's what I've committed. Tested x86_64-linux and powerpc64le-linux.
> >Pushed to trunk.
> >
> >
> >
>
> >commit b92d12d3fe3f1aa56d190d960e40c62869a6cfbb
> >Author: Cassio Neri 
> >Date:   Wed Jun 23 15:32:16 2021
> >
> >libstdc++: More efficient std::chrono::year::leap
> >
> >Simple change to std::chrono::year::is_leap. If a year is multiple of
> 100,
> >then it's divisible by 400 if and only if it's divisible by 16. The
> latter
> >allows for better code generation.
> >
> >The expression is then either y%16 or y%4 which are both powers of two
> >and so it can be rearranged to use simple bitmask operations.
> >
> >Co-authored-by: Jonathan Wakely 
> >Co-authored-by: Ulrich Drepper 
> >
> >libstdc++-v3/ChangeLog:
> >
> >* include/std/chrono (chrono::year::is_leap()): Optimize.
> >
> >diff --git a/libstdc++-v3/include/std/chrono
> b/libstdc++-v3/include/std/chrono
> >index 4631a727d73..863b6a27bdf 100644
> >--- a/libstdc++-v3/include/std/chrono
> >+++ b/libstdc++-v3/include/std/chrono
> >@@ -1606,13 +1606,18 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >   // [1] https://github.com/cassioneri/calendar
> >   // [2]
> https://accu.org/journals/overload/28/155/overload155.pdf#page=16
> >
> >+  // Furthermore, if y%100 != 0, then y%400==0 is equivalent to
> y%16==0,
> >+  // so we can rearrange the expression to (mult_100 ? y % 4 : y %
> 16)==0
>
> But Ulrich pointed out I got my boolean logic all muddled up in the
> comment. Fixed with the attached patch!
>
>