[PING] Re: [PATCH v7 00/34] libgcc: Thumb-1 Floating-Point Assembly for Cortex M0

Daniel Engel Tue, 15 Nov 2022 07:26:06 -0800

Hello, 

Is there still any interest in merging this patch?


Thanks,
Daniel


On Mon, Oct 31, 2022, at 8:44 AM, Daniel Engel wrote:
> Hi Richard,
>
> I am re-submitting my libgcc patch from 2021:
>
>     https://gcc.gnu.org/pipermail/gcc-patches/2021-January/563585.html
>     https://gcc.gnu.org/pipermail/gcc-patches/2021-December/587383.html
>
> I believe I have finally made the stage1 window. 
>
> Regards,
> Daniel
>
> ---
>
> Changes since v6:
>
>     * Rebased and tested with gcc-13
>
> There are no regressions for -march={armv4t,armv6s-m,armv7-m,armv7-a}.
> Clean master:
>
>     # of expected passes            529397
>     # of unexpected failures        41160
>     # of unexpected successes       12
>     # of expected failures          3442
>     # of unresolved testcases       978
>     # of unsupported tests          28993
>
> Patched master:
>
>     # of expected passes            529397
>     # of unexpected failures        41160
>     # of unexpected successes       12
>     # of expected failures          3442
>     # of unresolved testcases       978
>     # of unsupported tests          28993
>
> ---
>
> This patch series adds an assembly-language implementation of IEEE-754 
> compliant
> single-precision functions designed for the Cortex M0 (v6m) architecture.  
> There
> are improvements to most of the EABI integer functions as well.  This is the
> ibgcc component of a larger library project originally proposed in 2018:
>
>     https://gcc.gnu.org/legacy-ml/gcc/2018-11/msg00043.html
>
> As one point of comparison, a test program [1] links 916 bytes from libgcc 
> with
> the patched toolchain vs 10276 bytes with gcc-arm-none-eabi-9-2020-q2 
> toolchain.
> That's a 90% size reduction.
>
> I have extensive test vectors [2], and this patch pass all tests on an 
> STM32F051.
> These vectors were derived from UCB [3], Testfloat [4], and IEEECC754 
> [5], plus
> many of my own generation.
>
> There may be some follow-on projects worth discussing:
>
>     * The library is currently integrated into the ARM v6s-m multilib only.  
> It
>     is likely that some other architectures would benefit from these routines.
>     However, I have NOT profiled the existing implementations (ieee754-sf.S) 
> to
>     estimate where improvements may be found.
>
>     * GCC currently lacks test for some functions, such as 
> __aeabi_[u]ldivmod().
>     There may be useful bits in [1] that can be integrated.
>
> On Cortex M0, the library has (approximately) the following properties:
>
> Function(s)                     Size (bytes)        Cycles              
> Stack   Accuracy
> __clzsi2                        50                  20                  
> 0       exact
> __clzsi2 (OPTIMIZE_SIZE)        22                  51                  
> 0       exact
> __clzdi2                        8+__clzsi2          4+__clzsi2          
> 0       exact
>
> __clrsbsi2                      8+__clzsi2          6+__clzsi2          
> 0       exact
> __clrsbdi2                      18+__clzsi2         (8..10)+__clzsi2    
> 0       exact
>
> __ctzsi2                        52                  21                  
> 0       exact
> __ctzsi2 (OPTIMIZE_SIZE)        24                  52                  
> 0       exact
> __ctzdi2                        8+__ctzsi2          5+__ctzsi2          
> 0       exact
>
> __ffssi2                        8                   6..(5+__ctzsi2)     
> 0       exact
> __ffsdi2                        14+__ctzsi2         9..(8+__ctzsi2)     
> 0       exact
>
> __popcountsi2                   52                  25                  
> 0       exact
> __popcountsi2 (OPTIMIZE_SIZE)   14                  9..201              
> 0       exact
> __popcountdi2                   34+__popcountsi2    46                  
> 0       exact
> __popcountdi2 (OPTIMIZE_SIZE)   12+__popcountsi2    17..401             
> 0       exact
>
> __paritysi2                     24                  14                  
> 0       exact
> __paritysi2 (OPTIMIZE_SIZE)     16                  38                  
> 0       exact
> __paritydi2                     2+__paritysi2       1+__paritysi2       
> 0       exact
>
> __umulsidi3                     44                  24                  
> 0       exact
> __mulsidi3                      30+__umulsidi3      24+__umulsidi3      
> 8       exact
> __muldi3 (__aeabi_lmul)         10+__umulsidi3      6+__umulsidi3       
> 0       exact
> __ashldi3 (__aeabi_llsl)        22                  13                  
> 0       exact
> __lshrdi3 (__aeabi_llsr)        22                  13                  
> 0       exact
> __ashrdi3 (__aeabi_lasr)        22                  13                  
> 0       exact
>
> __aeabi_lcmp                    20                  13                  
> 0       exact
> __aeabi_ulcmp                   16                  10                  
> 0       exact
>
> __udivsi3 (__aeabi_uidiv)       56                  72..385             
> 0       < 1 lsb
> __divsi3 (__aeabi_idiv)         38+__udivsi3        26+__udivsi3        
> 8       < 1 lsb
> __udivdi3 (__aeabi_uldiv)       164                 103..1394           
> 16      < 1 lsb
> __udivdi3 (OPTIMIZE_SIZE)       142                 120..1392           
> 16      < 1 lsb
> __divdi3 (__aeabi_ldiv)         54+__udivdi3        36+__udivdi3        
> 32      < 1 lsb
>
> __shared_float                  178
> __shared_float (OPTIMIZE_SIZE)  154
>
> __addsf3 (__aeabi_fadd)         116+__shared_float  31..76              
> 8       <= 0.5 ulp
> __addsf3 (OPTIMIZE_SIZE)        112+__shared_float  74                  
> 8       <= 0.5 ulp
> __subsf3 (__aeabi_fsub)         6+__addsf3          3+__addsf3          
> 8       <= 0.5 ulp
> __aeabi_frsub                   8+__addsf3          6+__addsf3          
> 8       <= 0.5 ulp
> __mulsf3 (__aeabi_fmul)         112+__shared_float  73..97              
> 8       <= 0.5 ulp
> __mulsf3 (OPTIMIZE_SIZE)        96+__shared_float   93                  
> 8       <= 0.5 ulp
> __divsf3 (__aeabi_fdiv)         132+__shared_float  83..361             
> 8       <= 0.5 ulp
> __divsf3 (OPTIMIZE_SIZE)        120+__shared_float  263..359            
> 8       <= 0.5 ulp
>
> __cmpsf2/__lesf2/__ltsf2        72                  33                  
> 0       exact
> __eqsf2/__nesf2                 4+__cmpsf2          3+__cmpsf2          
> 0       exact
> __gesf2/__gesf2                 4+__cmpsf2          3+__cmpsf2          
> 0       exact
> __unordsf2 (__aeabi_fcmpun)     4+__cmpsf2          3+__cmpsf2          
> 0       exact
> __aeabi_fcmpeq                  4+__cmpsf2          3+__cmpsf2          
> 0       exact
> __aeabi_fcmpne                  4+__cmpsf2          3+__cmpsf2          
> 0       exact
> __aeabi_fcmplt                  4+__cmpsf2          3+__cmpsf2          
> 0       exact
> __aeabi_fcmple                  4+__cmpsf2          3+__cmpsf2          
> 0       exact
> __aeabi_fcmpge                  4+__cmpsf2          3+__cmpsf2          
> 0       exact
>
> __floatundisf (__aeabi_ul2f)    14+__shared_float   40..81              
> 8       <= 0.5 ulp
> __floatundisf (OPTIMIZE_SIZE)   14+__shared_float   40..237             
> 8       <= 0.5 ulp
> __floatunsisf (__aeabi_ui2f)    0+__floatundisf     1+__floatundisf     
> 8       <= 0.5 ulp
> __floatdisf (__aeabi_l2f)       14+__floatundisf    7+__floatundisf     
> 8       <= 0.5 ulp
> __floatsisf (__aeabi_i2f)       0+__floatdisf       1+__floatdisf       
> 8       <= 0.5 ulp
>
> __fixsfdi (__aeabi_f2lz)        74                  27..33              
> 0       exact
> __fixunssfdi (__aeabi_f2ulz)    4+__fixsfdi         3+__fixsfdi         
> 0       exact
> __fixsfsi (__aeabi_f2iz)        52                  19                  
> 0       exact
> __fixsfsi (OPTIMIZE_SIZE)       4+__fixsfdi         3+__fixsfdi         
> 0       exact
> __fixunssfsi (__aeabi_f2uiz)    4+__fixsfsi         3+__fixsfsi         
> 0       exact
>
> __extendsfdf2 (__aeabi_f2d)     42+__shared_float   38                  
> 8       exact
> __truncsfdf2 (__aeabi_f2d)      88                  34                  
> 8       exact
> __aeabi_d2f                     56+__shared_float   54..58              
> 8       <= 0.5 ulp
> __aeabi_h2f                     34+__shared_float   34                  
> 8       exact
> __aeabi_f2h                     84                  23..34              
> 0       <= 0.5 ulp
>
> Copyright assignment is on file with the FSF.
>
> Thanks,
> Daniel Engel
>
>
> [1] // Test program for size comparison
>
>     extern int main (void)
>     {
>         volatile int x = 1;
>         volatile unsigned long long int y = 10;
>         volatile long long int z = x / y; // 64-bit division
>
>         volatile float a = x; // 32-bit casting
>         volatile float b = y; // 64 bit casting
>         volatile float c = z / b; // float division
>         volatile float d = a + c; // float addition
>         volatile float e = c * b; // float multiplication
>         volatile float f = d - e - c; // float subtraction
>
>         if (f != c) // float comparison
>             y -= (long long int)d; // float casting
>     }
>
> [2] http://danielengel.com/cm0_test_vectors.tgz
> [3] http://www.netlib.org/fp/ucbtest.tgz
> [4] http://www.jhauser.us/arithmetic/TestFloat.html
> [5] http://win-www.uia.ac.be/u/cant/ieeecc754.html
>
> ---
>
> Daniel Engel (34):
>   Add and restructure function declaration macros
>   Rename THUMB_FUNC_START to THUMB_FUNC_ENTRY
>   Fix syntax warnings on conditional instructions
>   Reorganize LIB1ASMFUNCS object wrapper macros
>   Add the __HAVE_FEATURE_IT and IT() macros
>   Refactor 'clz' functions into a new file
>   Refactor 'ctz' functions into a new file
>   Refactor 64-bit shift functions into a new file
>   Import 'clz' functions from the CM0 library
>   Import 'ctz' functions from the CM0 library
>   Import 64-bit shift functions from the CM0 library
>   Import 'clrsb' functions from the CM0 library
>   Import 'ffs' functions from the CM0 library
>   Import 'parity' functions from the CM0 library
>   Import 'popcnt' functions from the CM0 library
>   Refactor Thumb-1 64-bit comparison into a new file
>   Import 64-bit comparison from CM0 library
>   Merge Thumb-2 optimizations for 64-bit comparison
>   Import 32-bit division from the CM0 library
>   Refactor Thumb-1 64-bit division into a new file
>   Import 64-bit division from the CM0 library
>   Import integer multiplication from the CM0 library
>   Refactor Thumb-1 float comparison into a new file
>   Import float comparison from the CM0 library
>   Refactor Thumb-1 float subtraction into a new file
>   Import float addition and subtraction from the CM0 library
>   Import float multiplication from the CM0 library
>   Import float division from the CM0 library
>   Import integer-to-float conversion from the CM0 library
>   Import float-to-integer conversion from the CM0 library
>   Import float<->double conversion from the CM0 library
>   Import float<->__fp16 conversion from the CM0 library
>   Drop single-precision Thumb-1 soft-float functions
>   Add -mpure-code support to the CM0 functions.
>
>  libgcc/Makefile.in              |   5 +-
>  libgcc/config/arm/bpabi-lib.h   |  12 -
>  libgcc/config/arm/bpabi-v6m.S   | 206 -----------
>  libgcc/config/arm/bpabi.S       |  42 ---
>  libgcc/config/arm/bpabi.c       |  42 ---
>  libgcc/config/arm/clz2.S        | 371 ++++++++++++++++++++
>  libgcc/config/arm/ctz2.S        | 349 ++++++++++++++++++
>  libgcc/config/arm/eabi/fadd.S   | 324 +++++++++++++++++
>  libgcc/config/arm/eabi/fcast.S  | 533 ++++++++++++++++++++++++++++
>  libgcc/config/arm/eabi/fcmp.S   | 604 ++++++++++++++++++++++++++++++++
>  libgcc/config/arm/eabi/fdiv.S   | 261 ++++++++++++++
>  libgcc/config/arm/eabi/ffixed.S | 414 ++++++++++++++++++++++
>  libgcc/config/arm/eabi/ffloat.S | 247 +++++++++++++
>  libgcc/config/arm/eabi/fmul.S   | 215 ++++++++++++
>  libgcc/config/arm/eabi/fneg.S   |  76 ++++
>  libgcc/config/arm/eabi/fplib.h  |  80 +++++
>  libgcc/config/arm/eabi/futil.S  | 418 ++++++++++++++++++++++
>  libgcc/config/arm/eabi/idiv.S   | 299 ++++++++++++++++
>  libgcc/config/arm/eabi/lcmp.S   | 187 ++++++++++
>  libgcc/config/arm/eabi/ldiv.S   | 493 ++++++++++++++++++++++++++
>  libgcc/config/arm/eabi/lmul.S   | 218 ++++++++++++
>  libgcc/config/arm/eabi/lshift.S | 241 +++++++++++++
>  libgcc/config/arm/fp16.c        |   4 +
>  libgcc/config/arm/lib1funcs.S   | 549 ++++++++++-------------------
>  libgcc/config/arm/parity.S      | 120 +++++++
>  libgcc/config/arm/popcnt.S      | 212 +++++++++++
>  libgcc/config/arm/t-bpabi       |  10 +-
>  libgcc/config/arm/t-elf         | 138 +++++++-
>  libgcc/config/arm/t-softfp      |   2 +
>  29 files changed, 5997 insertions(+), 675 deletions(-)
>  delete mode 100644 libgcc/config/arm/bpabi.c
>  create mode 100644 libgcc/config/arm/clz2.S
>  create mode 100644 libgcc/config/arm/ctz2.S
>  create mode 100644 libgcc/config/arm/eabi/fadd.S
>  create mode 100644 libgcc/config/arm/eabi/fcast.S
>  create mode 100644 libgcc/config/arm/eabi/fcmp.S
>  create mode 100644 libgcc/config/arm/eabi/fdiv.S
>  create mode 100644 libgcc/config/arm/eabi/ffixed.S
>  create mode 100644 libgcc/config/arm/eabi/ffloat.S
>  create mode 100644 libgcc/config/arm/eabi/fmul.S
>  create mode 100644 libgcc/config/arm/eabi/fneg.S
>  create mode 100644 libgcc/config/arm/eabi/fplib.h
>  create mode 100644 libgcc/config/arm/eabi/futil.S
>  create mode 100644 libgcc/config/arm/eabi/idiv.S
>  create mode 100644 libgcc/config/arm/eabi/lcmp.S
>  create mode 100644 libgcc/config/arm/eabi/ldiv.S
>  create mode 100644 libgcc/config/arm/eabi/lmul.S
>  create mode 100644 libgcc/config/arm/eabi/lshift.S
>  create mode 100644 libgcc/config/arm/parity.S
>  create mode 100644 libgcc/config/arm/popcnt.S
>
> -- 
> 2.34.1

[PING] Re: [PATCH v7 00/34] libgcc: Thumb-1 Floating-Point Assembly for Cortex M0

Reply via email to