Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes: >> I'm still a bit sceptical about treating the high-part cost as lower. >> ISTM that the subreg cases are the ones that are truly “free” and any others >> should have a normal cost. So if CSE handled the subreg case itself (to >> model >> how the rtx would actually be generated) then aarch64 code would have to >> do less work. I imagine that will be true for other targets as well. > > I guess the main problem is that CSE lacks context because it's not until > after > combine that the high part becomes truly "free" when pushed into a high > operation.
Yeah. And the aarch64 code is just being asked to cost the operation it's given, which could for example come from an existing aarch64_simd_mov_from_<mode>high. I think we should try to ensure that a aarch64_simd_mov_from_<mode>high followed by some arithmetic on the result is more expensive than the fused operation (when fusing is possible). An analogy might be: if the cost code is given: (add (reg X) (reg Y)) then, at some later point, the (reg X) might be replaced with a multiplication, in which case we'd have a MADD operation and the addition is effectively free. Something similar would happen if (reg X) became a shift by a small amount on newer cores, although I guess then you could argue either that the cost of the add disappears or that the cost of the shift disappears. But we shouldn't count ADD as free on the basis that it could be combined with a multiplication or shift in future. We have to cost what we're given. I think the same thing applies to the high part. Here we're trying to prevent cse1 from replacing a DUP (lane) with a MOVI by saying that the DUP is strictly cheaper than the MOVI. I don't think that's really true though, and the cost tables in the patch say that DUP is more expensive (rather than less expensive) than MOVI. Also, if I've understood correctly, it looks like we'd be relying on the vget_high of a constant remaining unfolded until RTL cse1. I think it's likely in future that we'd try to fold vget_high at the gimple level instead, since that could expose more optimisations of a different kind. The gimple optimisers would then fold vget_high(constant) in a similar way to cse1 does now. So perhaps we should continue to allow the vget_high(constant) to be foloded in cse1 and come up with some way of coping with the folded form. Thanks, Richard