https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94037
--- Comment #8 from ncm at cantrip dot org --- It seems worth mentioning that the round trip through L1 cache is just a workaround for the optimizer refusing to ever emit two CMOV instructions in a basic block. Recognizing and replacing the construct with CMOVs explicitly would speed up a great many algorithms. Although, the L1 excursion remains necessary for the general case of user-defined types. It also seems worth mention that there is no worry over dependency chains, in partitioning. Once the values are swapped they are not looked at again until the next pass.