https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82498
--- Comment #4 from Jakub Jelinek <jakub at gcc dot gnu.org> --- Two further cases: unsigned f10 (unsigned x, unsigned char y) { y %= __CHAR_BIT__ * __SIZEOF_INT__; return (x << y) | (x >> (-y & ((__CHAR_BIT__ * __SIZEOF_INT__) - 1))); } unsigned f11 (unsigned x, unsigned short y) { y %= __CHAR_BIT__ * __SIZEOF_INT__; return (x << y) | (x >> (-y & ((__CHAR_BIT__ * __SIZEOF_INT__) - 1))); } On f11 GCC generates also efficient code, on f10 useless &. Guess the f10 case would be improved by addition of a *<rotate_insn><mode>3_mask_1 define_insn_and_split (and similarly the inefficient/nonportable f1 code would be slightly improved). Looking at LLVM, f1/f3/f5 are worse in LLVM than in GCC, and in all cases instead of cmov it uses branching; f7/f8/f9/f10/f11 all generate efficient code though, so the same like GCC in case of f8 and f11.