https://bugs.kde.org/show_bug.cgi?id=385411

--- Comment #14 from Julian Seward <jsew...@acm.org> ---
(In reply to Andreas Arnez from comment #9)
> For the record, Julian Seward commented the following in IRC:
> 
> * Regarding the fused multiply-add/subs:
> 
> "I think the *right* fix here is to create new Iops, Iop_MAddF64x2 and
> IOpMSubF64x2 and use those instead (see libvex_ir.h, Iop_MAddF64 for
> description) or (ugly slow hack) in s390_vector_fp_mulAddOrSub_operation,
> for the vector case, split up the each operand into 2 64-bit scalars, use
> the existing scalar FMA operations, and reconstruct the vector (it will
> generate worse code, but it is less effort to implement because you don't
> need to change any code-generator stuff)."

Regarding the split/scalar-op/reconstruct strategy.  That's probably not
too bad for dealing with 64 bit lanes in a 128 bit vector.  But as the
vector width grows (next year, 256 bit vectors in s390, maybe?) and the
lane size gets smaller (F32 ops, I16 ops) etc, this begins to generate
terribly slow and verbose code compared to the "right" fix.  I mention
this because, in the amd64 (x86_64) front end, this problem has now become
extreme -- eg, a 256 vector instruction comprising 8 x (convert i32 to f32)
becomes easily 200 instructions at the back end.

So investing in the "right" fix might be necessary at some point.

-- 
You are receiving this mail because:
You are watching all bug changes.

Reply via email to