https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103641
--- Comment #22 from Roger Sayle <roger at nextmovesoftware dot com> --- I completely agree with Richard that the decision to vectorize or not to vectorize should be made elsewhere taking the whole function/loop into account. It's quite reasonable to synthesize a slow vector multiply if there's an overall benefit from SLP. What I think is required is that the "baseline" cost should be the cost of moving from the vector to a scalar mode, performing the multiplication(s) as a scalar and moving the result back again. i.e. we're assuming that we're always going to multiply the value in a vector register, we're just choosing the cheapest implementation for it. For the xxhash.i testcase, I'm seeing DI mode multiplications with COSTS_N_INSNS(30) [i.e. a mult_cost of 120]. Even with slow inter-unit moves it must be possible to do this faster on AArch64? In fact, we'll probably vectorize more in SLP, if we have the option to shuffle data back to the scalar multiplier if required. Perhaps even a define_insn_and_split of mulv2di3 to fool the middle-end into thinking we can do this "natively" via an optab. Note that multipliers used in cryptographic hash functions are sometimes (chosen to be) pathological to synth_mult. Like the design of DES' sboxes, these are coefficients designed to be slow to implement in software [and faster in custom hardware]. 64bit values with around 32 (random) bits set. I/we can try to speed up the recursion in synth_mult, and/or increase the size of the hash-table cache [which will help hppa64 and other targets with slow multipliers] but that's perhaps just working around the deeper issue with this PR.