Using 'gcc -Os -fomit-frame-pointer -march=core2 -mtune=core2' for unsigned short mul_high_c(unsigned short a, unsigned short b) { return (unsigned)(a * b) >> 16; }
unsigned short mul_high_asm(unsigned short a, unsigned short b) { unsigned short res; asm("mulw %w2" : "=d"(res),"+a"(a) : "rm"(b)); return res; } I get _mul_high_c: subl $12, %esp movzwl 20(%esp), %eax movzwl 16(%esp), %edx addl $12, %esp imull %edx, %eax shrl $16, %eax ret _mul_high_asm: subl $12, %esp movl 16(%esp), %eax mulw 20(%esp) addl $12, %esp movl %edx, %eax ret mulw puts its outputs in dx:ax, and dx contains (dx:ax)>>16, so the shift is avoided. Ignoring the weird Darwin stack adjustment code, the version with mulw is somewhat shorter and avoids a movzwl. I'm not sure what the performance difference is; mulw is listed in Agner's tables as fairly low latency, but requires a length changing prefix for memory. This type of operation is useful in fixed-point math, such as embedded audio codecs or arithmetic coders. -- Summary: x86 -Os could use mulw for (uint16 * uint16)>>16 Product: gcc Version: 4.4.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: astrange at ithinksw dot com GCC build triplet: i?86-*-* GCC host triplet: i?86-*-* GCC target triplet: i?86-*-* http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39329