https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463
--- Comment #22 from Tamar Christina <tnfchris at gcc dot gnu.org> --- Ok, so the problem with the ones on trunk isn't necessarily the canonicalization itself but that our externals handling is a bit shallow. On externals we determine that we have no information on the DF and return TOP. This is because DR analysis doesn't try to handle externals since they're not part of the loop. However all we need to know for complex numbers is whether the externals are loaded from the same place and the order of them. concretely the loop pre-header is: <bb 2> [local count: 10737416]: b$real_11 = REALPART_EXPR <b_15(D)>; b$imag_10 = IMAGPART_EXPR <b_15(D)>; _53 = -b$imag_10; and the loop body: <bb 3> [local count: 1063004408]: ... _23 = REALPART_EXPR <*_5>; _24 = IMAGPART_EXPR <*_5>; _27 = _24 * _53; _28 = _23 * _53; codegen before after: {_24, _23} * { _53, _53 } and after { _24, _24 } * { _53, b$real_11 } Before we were able to easily tell that the order for the multiply would be IMAG, REAL. In the after (GCC 15) case that information is there, but requires us to follow the externals. Richi what do you think about extending externals handling in linear_loads_p to follow all external ops and if they load from the same memref to figure out the "virtual lane permute"? We can store the info in a new externals cache (to avoid re-walking externals we already walked, as perm_cache stores slp nodes) and the permute for the node in the perm_cache as we do for any cached lookups today? This would also fix the other tests Andrew added in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463#c4