https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65456
Bill Schmidt <wschmidt at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |wschmidt at gcc dot gnu.org --- Comment #9 from Bill Schmidt <wschmidt at gcc dot gnu.org> --- So, this can be viewed either as a phase-ordering problem or an expand problem; probably the latter is more correct. The swap optimization runs early in the RTL phases, shortly after expand. At the time that it sees this computation, the RTL represents something more like this: lxvd2x 0,9,4 xxpermdi 12,0,0,2 [r18=high half of vs12] [r19=low half of vs12] std 18,0,28 std 19,8,28 (This is well before RA so I am making up register numbers for illustration purposes.) The swap optimization doesn't know what to do when a vector is split into pieces, so it punts here. Later, in the split2 phase that runs after RA, the last four lines above are recognized as a pattern that can be replaced by an stxvd2x in BE mode, or by an xxswapd/stxvd2x in LE mode. This is how we end up with the code you see in the final output. The question is, why is the expander generating the two doubleword stores? Probably because it thinks that we are ok generating unaligned doubleword stores, but not ok generating unaligned quadword stores. In other words, there is probably something in there that needs to be taught that unaligned vector stores on P8 are better to use than moving pieces into GPRs and storing them separately. I will take on the investigation of this, but there are a few more urgent things that need attention first. I expect this to be a fairly easy fix that we'll be able to backport.