Hi,

The existing sincos functions use 2 pointers to return the sine and cosine 
result. In
most cases 4 memory accesses are necessary per call. This is inefficient and 
often
significantly slower than returning values in registers. I ran a few 
experiments on the
new optimized sincosf implementation in GLIBC using the following interface:

__complex__ float sincosf2 (float);

This has 50% higher throughput and a 25% reduction in latency on Cortex-A72 for
random inputs in the range +-PI/4. Larger inputs take longer and thus have lower
gains, but there is still a 5% gain on the (rarely used) path with full range 
reduction.
Given sincos is used in various HPC applications this can give a worthwile 
speedup.

LLVM already supports something similar for OSX using a struct of 2 floats.
Using complex float is better since not all targets may support returning 
structures in
floating point registers and GCC generates very inefficient code on targets 
that do
(PR86145).

What do people think? Ideally I'd like to support this in a generic way so all 
targets can
benefit, but it's also feasible to enable it on a per-target basis. Also since 
not all libraries
will support the new interface, there would have to be a flag or configure 
option to switch
the new interface off if not supported (maybe automatically based on the math.h 
header).

Wilco

Reply via email to