Using 'gcc -Os -fomit-frame-pointer -march=core2 -mtune=core2' for

unsigned short mul_high_c(unsigned short a, unsigned short b)
{
    return (unsigned)(a * b) >> 16;
}

unsigned short mul_high_asm(unsigned short a, unsigned short b)
{
    unsigned short res;
    asm("mulw %w2" : "=d"(res),"+a"(a) : "rm"(b));
    return res;
}

I get

_mul_high_c:
        subl    $12, %esp
        movzwl  20(%esp), %eax
        movzwl  16(%esp), %edx
        addl    $12, %esp
        imull   %edx, %eax
        shrl    $16, %eax
        ret
_mul_high_asm:
        subl    $12, %esp
        movl    16(%esp), %eax
        mulw 20(%esp)
        addl    $12, %esp
        movl    %edx, %eax
        ret

mulw puts its outputs in dx:ax, and dx contains (dx:ax)>>16, so the shift is
avoided.

Ignoring the weird Darwin stack adjustment code, the version with mulw is
somewhat shorter and avoids a movzwl. I'm not sure what the performance
difference is; mulw is listed in Agner's tables as fairly low latency, but
requires a length changing prefix for memory.

This type of operation is useful in fixed-point math, such as embedded audio
codecs or arithmetic coders.


-- 
           Summary: x86 -Os could use mulw for (uint16 * uint16)>>16
           Product: gcc
           Version: 4.4.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: astrange at ithinksw dot com
 GCC build triplet: i?86-*-*
  GCC host triplet: i?86-*-*
GCC target triplet: i?86-*-*


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39329

Reply via email to