[PATCH v5 00/33] libgcc: Thumb-1 Floating-Point Library for Cortex M0

Daniel Engel Fri, 15 Jan 2021 03:31:20 -0800

Changes since v4: 

* Revised all commit messages per GCC standard form. 
* Split preamble patch 1 into 4 distinct changes. 
* Flattened previously-created directory "bits"
* Added patch to fix unified syntax compiler warnings.
* Moved CFI macro changes to preamble patch 1. 
* Added interim copyright message to refactored files. 
* Added expanation and usage comments for the IT() macro.
* Renamed new __ARM_FEATURE_IT macro as __HAVE_FEATURE_IT.


---

This patch series adds an assembly-language implementation of IEEE-754 compliant
single-precision functions designed for the Cortex M0 (v6m) architecture.  
There 
are improvements to most of the EABI integer functions as well.  This is the
ibgcc component of a larger library project originally proposed in 2018:

    https://gcc.gnu.org/legacy-ml/gcc/2018-11/msg00043.html

As one point of comparison, a test program [1] links 916 bytes from libgcc with
the patched toolchain vs 10276 bytes with gcc-arm-none-eabi-9-2020-q2 toolchain.
That's a 90% size reduction.

I have extensive test vectors [2], and this patch pass all tests on an 
STM32F051.
These vectors were derived from UCB [3], Testfloat [4], and IEEECC754 [5], plus
many of my own generation.

There may be some follow-on projects worth discussing:

    * The library is currently integrated into the ARM v6s-m multilib only.  It
    is likely that some other architectures would benefit from these routines.
    However, I have NOT profiled the existing implementations (ieee754-sf.S) to
    estimate where improvements may be found.

    * GCC currently lacks test for some functions, such as __aeabi_[u]ldivmod().
    There may be useful bits in [1] that can be integrated.

On Cortex M0, the library has (approximately) the following properties:

Function(s)                     Size (bytes)        Cycles              Stack   
Accuracy
__clzsi2                        50                  20                  0       
exact
__clzsi2 (OPTIMIZE_SIZE)        22                  51                  0       
exact
__clzdi2                        8+__clzsi2          4+__clzsi2          0       
exact

__clrsbsi2                      8+__clzsi2          6+__clzsi2          0       
exact
__clrsbdi2                      18+__clzsi2         (8..10)+__clzsi2    0       
exact

__ctzsi2                        52                  21                  0       
exact
__ctzsi2 (OPTIMIZE_SIZE)        24                  52                  0       
exact
__ctzdi2                        8+__ctzsi2          5+__ctzsi2          0       
exact

__ffssi2                        8                   6..(5+__ctzsi2)     0       
exact
__ffsdi2                        14+__ctzsi2         9..(8+__ctzsi2)     0       
exact

__popcountsi2                   52                  25                  0       
exact
__popcountsi2 (OPTIMIZE_SIZE)   14                  9..201              0       
exact
__popcountdi2                   34+__popcountsi2    46                  0       
exact
__popcountdi2 (OPTIMIZE_SIZE)   12+__popcountsi2    17..401             0       
exact

__paritysi2                     24                  14                  0       
exact
__paritysi2 (OPTIMIZE_SIZE)     16                  38                  0       
exact
__paritydi2                     2+__paritysi2       1+__paritysi2       0       
exact

__umulsidi3                     44                  24                  0       
exact
__mulsidi3                      30+__umulsidi3      24+__umulsidi3      8       
exact
__muldi3 (__aeabi_lmul)         10+__umulsidi3      6+__umulsidi3       0       
exact
__ashldi3 (__aeabi_llsl)        22                  13                  0       
exact
__lshrdi3 (__aeabi_llsr)        22                  13                  0       
exact
__ashrdi3 (__aeabi_lasr)        22                  13                  0       
exact

__aeabi_lcmp                    20                  13                  0       
exact
__aeabi_ulcmp                   16                  10                  0       
exact

__udivsi3 (__aeabi_uidiv)       56                  72..385             0       
< 1 lsb
__divsi3 (__aeabi_idiv)         38+__udivsi3        26+__udivsi3        8       
< 1 lsb
__udivdi3 (__aeabi_uldiv)       164                 103..1394           16      
< 1 lsb
__udivdi3 (OPTIMIZE_SIZE)       142                 120..1392           16      
< 1 lsb
__divdi3 (__aeabi_ldiv)         54+__udivdi3        36+__udivdi3        32      
< 1 lsb

__shared_float                  178
__shared_float (OPTIMIZE_SIZE)  154

__addsf3 (__aeabi_fadd)         116+__shared_float  31..76              8       
<= 0.5 ulp
__addsf3 (OPTIMIZE_SIZE)        112+__shared_float  74                  8       
<= 0.5 ulp
__subsf3 (__aeabi_fsub)         6+__addsf3          3+__addsf3          8       
<= 0.5 ulp
__aeabi_frsub                   8+__addsf3          6+__addsf3          8       
<= 0.5 ulp
__mulsf3 (__aeabi_fmul)         112+__shared_float  73..97              8       
<= 0.5 ulp
__mulsf3 (OPTIMIZE_SIZE)        96+__shared_float   93                  8       
<= 0.5 ulp
__divsf3 (__aeabi_fdiv)         132+__shared_float  83..361             8       
<= 0.5 ulp
__divsf3 (OPTIMIZE_SIZE)        120+__shared_float  263..359            8       
<= 0.5 ulp

__cmpsf2/__lesf2/__ltsf2        72                  33                  0       
exact
__eqsf2/__nesf2                 4+__cmpsf2          3+__cmpsf2          0       
exact
__gesf2/__gesf2                 4+__cmpsf2          3+__cmpsf2          0       
exact
__unordsf2 (__aeabi_fcmpun)     4+__cmpsf2          3+__cmpsf2          0       
exact
__aeabi_fcmpeq                  4+__cmpsf2          3+__cmpsf2          0       
exact
__aeabi_fcmpne                  4+__cmpsf2          3+__cmpsf2          0       
exact
__aeabi_fcmplt                  4+__cmpsf2          3+__cmpsf2          0       
exact
__aeabi_fcmple                  4+__cmpsf2          3+__cmpsf2          0       
exact
__aeabi_fcmpge                  4+__cmpsf2          3+__cmpsf2          0       
exact

__floatundisf (__aeabi_ul2f)    14+__shared_float   40..81              8       
<= 0.5 ulp
__floatundisf (OPTIMIZE_SIZE)   14+__shared_float   40..237             8       
<= 0.5 ulp
__floatunsisf (__aeabi_ui2f)    0+__floatundisf     1+__floatundisf     8       
<= 0.5 ulp
__floatdisf (__aeabi_l2f)       14+__floatundisf    7+__floatundisf     8       
<= 0.5 ulp
__floatsisf (__aeabi_i2f)       0+__floatdisf       1+__floatdisf       8       
<= 0.5 ulp

__fixsfdi (__aeabi_f2lz)        74                  27..33              0       
exact
__fixunssfdi (__aeabi_f2ulz)    4+__fixsfdi         3+__fixsfdi         0       
exact
__fixsfsi (__aeabi_f2iz)        52                  19                  0       
exact
__fixsfsi (OPTIMIZE_SIZE)       4+__fixsfdi         3+__fixsfdi         0       
exact
__fixunssfsi (__aeabi_f2uiz)    4+__fixsfsi         3+__fixsfsi         0       
exact

__extendsfdf2 (__aeabi_f2d)     42+__shared_float   38                  8       
exact
__truncsfdf2 (__aeabi_f2d)      88                  34                  8       
exact
__aeabi_d2f                     56+__shared_float   54..58              8       
<= 0.5 ulp
__aeabi_h2f                     34+__shared_float   34                  8       
exact
__aeabi_f2h                     84                  23..34              0       
<= 0.5 ulp

Copyright assignment is on file with the FSF.

Thanks,
Daniel Engel


[1] // Test program for size comparison

    extern int main (void)
    {
        volatile int x = 1;
        volatile unsigned long long int y = 10;
        volatile long long int z = x / y; // 64-bit division

        volatile float a = x; // 32-bit casting
        volatile float b = y; // 64 bit casting
        volatile float c = z / b; // float division
        volatile float d = a + c; // float addition
        volatile float e = c * b; // float multiplication
        volatile float f = d - e - c; // float subtraction

        if (f != c) // float comparison
            y -= (long long int)d; // float casting
    }

[2] http://danielengel.com/cm0_test_vectors.tgz
[3] http://www.netlib.org/fp/ucbtest.tgz
[4] http://www.jhauser.us/arithmetic/TestFloat.html
[5] http://win-www.uia.ac.be/u/cant/ieeecc754.html

[PATCH v5 00/33] libgcc: Thumb-1 Floating-Point Library for Cortex M0

Reply via email to