https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117722
--- Comment #16 from Vineet Gupta <vineetg at gcc dot gnu.org> --- (In reply to Robin Dapp from comment #15) > (In reply to Vineet Gupta from comment #14) > > @Robin, it seems the current codegen generates 2 widening ops, which might > > not be as efficient. We have done some profiling of widening add throughput > > and Edwin's data tells me that the throughput might not be the same. > > Hmm, would you ever want the widening ops if the throughput is worse then? > I.e. if you had a throughput of 2 for simple adds and zexts but 1 for vwadd > could you not disable them altogether if they "clog" the pipeline? Right. We need to experiment some more and see how it plays on real hw. But the point here really here is we don't need the widening semantics, more twice. The min+max+sub in loops with a final reducing sum should do the trick. I'm just going by the data Edwin generated on running microprobes on BPI3 (for back-back ops). I don't think he has posted that into the public portal yet [1] [1] https://github.com/ewlu/bp3-microarch