Jacob Navia via Gcc <gcc@gcc.gnu.org> writes: > We have 2 loads, and 1 operation + a store. 4 instructions compared to > 46 operations for the « gcc way » (16 loads of a byte, 14 x 2 OR > operations and 8 shifts to split the result and 8 stores of a byte > each.
The sample code seems to have a couple of errors; I fixed it up and put it on godbolt: https://godbolt.org/z/obbr7K7dx Let me know if the fixups were wrong. The issue should probably be reported on Bugzilla as a missed-optimization bug. /Benny