While looking at a profile of gcc, I noticed one thing fairly high up the list was a loop iterating over all the registers in a REG, apparently due to the delay in computing the index for hard_regno_nregs and then loading the value (which would often be an L1 cache miss).
When we were adding CONST_WIDE_INT, the general opinion seemed to be that we should lay out rtxes for LP64 hosts rather than try to have two alternative layouts, one optimised for ILP32 and one for LP64. We therefore unconditionally filled the 32-bit hole (on LP64) between the rtx header and the main union with extra data. That area is already used by REGs to store ORIGINAL_REGNO, but on LP64 hosts there's another hole in the REGNO field itself. This series takes that idea a step further and uses the hole to store the number of registers in a REG. This still leaves 24 redundant bits that could be used for other things in future. That's actually enough to store a SUBREG of a REG (8 bits for the inner mode, 16 for the offset), but having a single rtx for that would probably cause too many problems. The series sped up an --enable-checking=release gcc by just over 0.5% for various tests on my box. Not a big saving, but hopefully the patches also count as a clean-up. As a follow-on, I'd like to add a FOR_EACH_* macro that iterates over all the registers in a REG. These loops always execute at least once, and rarely more than once, and it would be good to model that in the iterator so that all use sites benefit. Each patch in the series was individually bootstrapped & regression-tested on x86_64-linux-gnu. Thanks, Richard