Hi, The existing sincos functions use 2 pointers to return the sine and cosine result. In most cases 4 memory accesses are necessary per call. This is inefficient and often significantly slower than returning values in registers. I ran a few experiments on the new optimized sincosf implementation in GLIBC using the following interface:
__complex__ float sincosf2 (float); This has 50% higher throughput and a 25% reduction in latency on Cortex-A72 for random inputs in the range +-PI/4. Larger inputs take longer and thus have lower gains, but there is still a 5% gain on the (rarely used) path with full range reduction. Given sincos is used in various HPC applications this can give a worthwile speedup. LLVM already supports something similar for OSX using a struct of 2 floats. Using complex float is better since not all targets may support returning structures in floating point registers and GCC generates very inefficient code on targets that do (PR86145). What do people think? Ideally I'd like to support this in a generic way so all targets can benefit, but it's also feasible to enable it on a per-target basis. Also since not all libraries will support the new interface, there would have to be a flag or configure option to switch the new interface off if not supported (maybe automatically based on the math.h header). Wilco