From: Richard Henderson <r...@redhat.com>
Date: Wed, 12 Oct 2011 17:49:19 -0700

> There's a code sample 7-1 that illustrates a 16x16 multiply:
> 
>       fmul8sux16 %f0, %f1, %f2
>       fmul8ulx16 %f0, %f1, %f3
>       fpadd16    %f2, %f3, %f4

Be wary of code examples that don't even assemble (even numbered
float registers are required here).

fmul8sux16 basically does, for each element:

        src1 = (rs1 >> 8) & 0xff;
        src2 = rs2 & 0xffff;

        product = src1 * src2;

        scaled = (product & 0x00ffff00) >> 8;
        if (product & 0x80)
                scaled++;

        rd = scaled & 0xffff;

fmul8ulx16 does the same except the assignment to src1 is:

        src1 = rs1 & 0xff;

Therefore, I think this "16 x 16 multiply" operation isn't the kind
you think it is, and it's therefore not appropriate to use this in the
compiler for vector multiplies.

Just for shits and grins I tried it and the slp-7 testcase, as expected,
fails.  The main multiply loop in that test case is compiled to:

        sethi   %hi(.LLC6), %i3
        sethi   %hi(in2), %g1
        ldd     [%i3+%lo(.LLC6)], %f22
        sethi   %hi(.LLC7), %i4
        sethi   %hi(.LLC8), %i2
        sethi   %hi(.LLC9), %i3
        add     %fp, -256, %g2
        ldd     [%i4+%lo(.LLC7)], %f20
        or      %g1, %lo(in2), %g1  
        ldd     [%i2+%lo(.LLC8)], %f18
        mov     %fp, %i5
        ldd     [%i3+%lo(.LLC9)], %f16
        mov     %g1, %g4
        mov     %g2, %g3
.LL10:
        ldd     [%g4+8], %f14
        ldd     [%g4+16], %f12
        fmul8sux16      %f14, %f22, %f26
        ldd     [%g4+24], %f10    
        fmul8ulx16      %f14, %f22, %f24
        ldd     [%g4], %f8
        fmul8sux16      %f12, %f20, %f34
        fmul8ulx16      %f12, %f20, %f32
        fmul8sux16      %f10, %f18, %f30
        fpadd16 %f26, %f24, %f14
        fmul8ulx16      %f10, %f18, %f28
        fmul8sux16      %f8, %f16, %f26
        fmul8ulx16      %f8, %f16, %f24
        fpadd16 %f34, %f32, %f12
        std     %f14, [%g3+8]
        fpadd16 %f30, %f28, %f10
        std     %f12, [%g3+16]
        fpadd16 %f26, %f24, %f8
        std     %f10, [%g3+24]
        std     %f8, [%g3]
        add     %g3, 32, %g3
        cmp     %g3, %i5
        bne,pt  %icc, .LL10
         add    %g4, 32, %g4

and it simply gives the wrong results.

The entire out2[] array is all zeros.

Reply via email to