On Wed, Jul 5, 2023 at 5:14 PM Sylvain Noiry via Gcc <[email protected]> wrote:
>
> Hi,
>
> My name is Sylvain, I am an intern at Kalray and I work on improving the GCC
> backend for the KVX target. The KVX ISA has dedicated instructions for the
> handling of complex numbers, which cannot be selected by GCC due to how
> complex numbers are handled internally. My goal is to make GCC able to
> expose to machine description files new patterns dealing with complex
> numbers. I already have a proof of concept which can increase performance
> even on other backends like x86 if the new patterns are implemented.
>
> My approach is to prevent the lowering of complex operations when the backend
> can handle it natively and work directly on complex modes (SC, DC, CDI, CSI,
> CHI, CQI). The cplxlower pass looks for supported optabs related to complex
> numbers and use them directly. Another advantage is that native operations
> can now go through all GIMPLE passes and preserve most optimisations like FMA
> generation.
I'll note that complex lowering takes advantage of complex numbers
without real/imag parts, I suppose you are preseving some
of these optimizations and only prevent lowering of ops we natively
support. I think that's a reasonable thing and I agree the
standard optabs should be used with complex modes.
> Vectorization is also preserved with native complex operands, although some
> functions were updated. Because vectorization assumes that inner elements are
> scalar and complex cannot be considered as scalar, some functions which only
> take scalars have been adapted or duplicated to handle complex elements.
I don't quite understand whether you end up with vectors with complex
components or vectors with twice the number of
scalar elements, implicitely representing real/imag parts interleaved.
Can you clarify? We've recently had discussions
around this and agreed we don't want vector modes with complex component modes.
> I've also changed the representation of complex numbers during the expand
> pass. READ_COMPLEX_PART and WRITE_COMPLEX_PART have been transformed into
> target hooks, and a new hook GEN_RTX_COMPLEX allows each backend to choose
> its preferred complex representation in RTL. The default one uses CONCAT
> like before, but the KVX backend uses registers with complex mode containing
> both real and imaginary parts.
>
> Now each backend can add its own native complex operations with patterns in
> its machine description. The following example implements a complex
> multiplication with mode SC on the KVX backend:
> (define_insn "mulsc3"
> [(set (match_operand:SC 0 "register_operand" "=r")
> (mult:SC (match_operand:SC 1 "register_operand" "r")
> (match_operand:SC 2 "register_operand" "r")))]
> ""
> "fmulwc %0 = %1, %2"
> [(set_attr "type" "mau_fpu")]
> )
>
> The main patch affects around 1400 lines of generic code, mostly located in
> expr.cc and tree-complex.cc. These are mainly additions or the result of the
> move of READ_COMPLEX_PART and WRITE_COMPLEX_PART from expr.cc to target hooks.
>
> I know that ARM developers have added partial support of complex
> instructions. However, since they are operating during the vectorization,
> and are promoting operations on vectors of floating point numbers that looks
> like operations on (vectors of) complex numbers, their approach misses simple
> cases. At this point they create operations working on vector of floating
> point numbers which will be caught by dedicated define_expand later. On the
> other hand, our approach propagates complex numbers through all the
> middle-end and we have an easier time to recombine the operations and
> recognize what ARM does. Some choices will be needed to merge our two
> approaches, although I've already reused their work on complex rotations in
> my implementation.
>
> Results:
>
> I have tested my implementation on multiple code samples, as well as a few
> FFTs. On a simple in-place radix-2 with precomputed twiddle seeds (2 complex
> mult, 1 add, and 1 sub per loop), the compute time has been divided by 3 when
> compiling with -O3 (because calls to __mulsc3 are replaced by native
> instructions) and shortened by 20% with -ffast-math. In both cases, the
> achieved performance level is now on par with another version coded using
> intrinsics. These improvements do not come exclusively from the new
> generated hardware instructions, the replacement of CONCATs to registers
> prevents GCC from generating instructions to extract the real and imaginary
> part into their own registers and recombine them later.
>
> This new approach can also brings a performance uplift to other backends. I
> have tried to reuse the same complex representation in rtl as KVX for x86,
> and a few patterns. Although I still have useless moves on large programs,
> simple examples like below already show performance uplift.
>
> _Complex float add(_Complex float a, _Complex float b)
> {
> return a + b;
> }
Yeah, the splitting doesn't help our bad job dealing with parameter
and return value expansion and your
change likely skirts that issue. Vectorizing would likely fix it in a
similar way.
> Using "-O2" the assembly produced is now on paar with llvm and looks like :
>
> add:
> addps %xmm1, %xmm0
> ret
>
> Choices to be done:
> - Currently, ARM uses optab which start with "c" like "cmul" to distinguish
> between a real floating point numbers and complex numbers. Since we keep
> complex mode, this could be simply done with mul<mode>.
> - Currently the parser does some early optimizations and lowering that
> could be moved into the cplxlower pass. For example, i've changed a bit how
> complex rotations by 90° and 270° are processed, which are recognized in
> fold-const.cc. A call to a new COMPLEX_ROT90/270 internal function is now
> inserted, which is then lowered or kept in the cplxlower pass. Finally the
> widening_mul pass can generate COMPLEX_ADD_ROT90/270 internal function, which
> are expanded using the cadd90/270 optabs, else COMPLEX_ROT90/270 are expanded
> using new crot90/270 optabs.
> - Currently, we have to duplicate the preferred_simd_mode since in only
> accept scalar modes, if we unify enough, we could have a new type that would
> be a union of scalar_mode and complex_mode, but we did not do it since it
> would incur many modifications.
> - Declaration of complex vector through attribute directives, this would be
> a new C extension (and clang does not support it either).
> - The KVX ISA supports some fused conjugate and operations (ex: a +
> conjf(b)), which are caught directly in the combine pass if the corresponding
> pattern in present the backend. This solution is simple, but it also mays be
> caught in the middle-end like FMAs.
>
> Currently supported patterns:
> - all basic arithmetic operations for scalar and vector complex modes (add,
> mul, neg, ...)
> - conj<mode> for the conjugate operation, using a new conj_optab
> - crot90<mode>/crot270<mode> for complex rotations, using new optabs
>
> I would like to have your opinion on my approach. I can send you the patch if
> you want.
It would be nice if you could split the patch into a series of changes.
Thanks,
Richard.
> Best regards,
>
> Sylvain Noiry
>
>
>