https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113604

--- Comment #12 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Jakub Jelinek <ja...@gcc.gnu.org>:

https://gcc.gnu.org/g:fbb569315a291d2d5b32ad0fdaf0c42da9f5e93b

commit r14-8761-gfbb569315a291d2d5b32ad0fdaf0c42da9f5e93b
Author: Jakub Jelinek <ja...@redhat.com>
Date:   Fri Feb 2 22:14:33 2024 +0100

    libgcc: Fix up _BitInt division [PR113604]

    The following testcase ends up with SIGFPE in __divmodbitint4.
    The problem is a thinko in my attempt to implement Knuth's algorithm.

    The algorithm does (where b is 65536, i.e. one larger than what
    fits in their unsigned short word):
            // Compute estimate qhat of q[j].
            qhat = (un[j+n]*b + un[j+n-1])/vn[n-1];
            rhat = (un[j+n]*b + un[j+n-1]) - qhat*vn[n-1];
          again:
            if (qhat >= b || qhat*vn[n-2] > b*rhat + un[j+n-2])
            { qhat = qhat - 1;
              rhat = rhat + vn[n-1];
              if (rhat < b) goto again;
            }
    The problem is that it uses a double-word / word -> double-word
    division (and modulo), while all we have is udiv_qrnnd unless
    we'd want to do further library calls, and udiv_qrnnd is a
    double-word / word -> word division and modulo.
    Now, as the algorithm description says, it can produce at most
    word bits + 1 bit quotient.  And I believe that actually the
    highest qhat the original algorithm can produce is
    (1 << word_bits) + 1.  The algorithm performs earlier canonicalization
    where both the divisor and dividend are shifted left such that divisor
    has msb set.  If it has msb set already before, no shifting occurs but
    we start with added 0 limb, so in the first uv1:uv0 double-word uv1
    is 0 and so we can't get too high qhat, if shifting occurs, the first
    limb of dividend is shifted right by UWtype bits - shift count into
    a new limb, so again in the first iteration in the uv1:uv0 double-word
    uv1 doesn't have msb set while vv1 does and qhat has to fit into word.
    In the following iterations, previous iteration should guarantee that
    the previous quotient digit is correct.  Even if the divisor was the
    maximal possible vv1:all_ones_in_all_lower_limbs, if the old
uv0:lower_limbs
    would be larger or equal to the divisor, the previous quotient digit
    would increase and another divisor would be subtracted, which I think
    implies that in the next iteration in uv1:uv0 double-word uv1 <= vv1,
    but uv0 could be up to all ones, e.g. in case of all lower limbs
    of divisor being all ones and at least one dividend limb below uv0
    being not all ones.  So, we can e.g. for 64-bit UWtype see
    uv1:uv0 / vv1 0x8000000000000000UL:0xffffffffffffffffUL /
0x8000000000000000UL
    or 0xffffffffffffffffUL:0xffffffffffffffffUL / 0xffffffffffffffffUL
    In all these cases (when uv1 == vv1 && uv0 >= uv1), qhat is
    0x10000000000000001UL, i.e. 2 more than fits into UWtype result,
    if uv1 == vv1 && uv0 < uv1 it would be 0x10000000000000000UL, i.e.
    1 more than fits into UWtype result.
    Because we only have udiv_qrnnd which can't deal with those too large
    cases (SIGFPEs or otherwise invokes undefined behavior on those), I've
    tried to handle the uv1 >= vv1 case separately, but for one thing
    I thought it would be at most 1 larger than what fits, and for two
    have actually subtracted vv1:vv1 from uv1:uv0 instead of subtracting
    0:vv1 from uv1:uv0.
    For the uv1 < vv1 case, the implementation already performs roughly
    what the algorithm does.
    Now, let's see what happens with the two possible extra cases in
    the original algorithm.
    If uv1 == vv1 && uv0 < uv1, qhat above would be b, so we take
    if (qhat >= b, decrement qhat by 1 (it becomes b - 1), add
    vn[n-1] aka vv1 to rhat and goto again if rhat < b (but because
    qhat already fits we can goto to the again label in the uv1 < vv1
    code).  rhat in this case is uv0 and rhat + vv1 can but doesn't
    have to overflow, say for uv0 42UL and vv1 0x8000000000000000UL
    it will not (and so we should goto again), while for uv0
    0x8000000000000000UL and vv1 0x8000000000000001UL it will (and
    we shouldn't goto again).
    If uv1 == vv1 && uv0 >= uv1, qhat above would be b + 1, so we
    take if (qhat >= b, decrement qhat by 1 (it becomes b), add
    vn[n-1] aka vv1 to rhat. But because vv1 has msb set and
    rhat in this case is uv0 - vv1, the rhat + vv1 addition
    certainly doesn't overflow, because (uv0 - vv1) + vv1 is uv0,
    so in the algorithm we goto again, again take if (qhat >= b and
    decrement qhat so it finally becomes b - 1, and add vn[n-1]
    aka vv1 to rhat again.  But this time I believe it must always
    overflow, simply because we added (uv0 - vv1) + vv1 + vv1 and
    vv1 has msb set, so already vv1 + vv1 must overflow.  And
    because it overflowed, it will not goto again.
    So, I believe the following patch implements this correctly, by
    subtracting vv1 from uv1:uv0 double-word once, then comparing
    again if uv1 >= vv1.  If that is true, subtract vv1 from uv1:uv0
    again and add 2 * vv1 to rhat, no __builtin_add_overflow is needed
    as we know it always overflowed and so won't goto again.
    If after the first subtraction uv1 < vv1, use __builtin_add_overflow
    when adding vv1 to rhat, because it can but doesn't have to overflow.

    I've added an extra testcase which tests the behavior of all the changed
    cases, so it has a case where uv1:uv0 / vv1 is 1:1, where it is
    1:0 and rhat + vv1 overflows and where it is 1:0 and rhat + vv1 does not
    overflow, and includes tests also from Zdenek's other failing tests.

    2024-02-02  Jakub Jelinek  <ja...@redhat.com>

            PR libgcc/113604
            * libgcc2.c (__divmodbitint4): If uv1 >= vv1, subtract
            vv1 from uv1:uv0 once or twice as needed, rather than
            subtracting vv1:vv1.

            * gcc.dg/torture/bitint-53.c: New test.
            * gcc.dg/torture/bitint-55.c: New test.

Reply via email to