On 06/21/10 16:13, Sebastian Pop wrote:
Hi,

I was looking at why, in the vectorized DCT kernel of FFmpeg, the insn
selection of GCC fails to produce XOP fused-multiply-add vector insns:
DOM is detecting a redundant expression that is optimized, and that
makes it impossible to detect the higher level insns in combine.

The DCT kernel looks like this:

static void
dct_unquantize_h263_inter_c (DCTELEM * block, int qscale, int nCoeffs)
{
   int i, level, qmul, qadd;

   qadd = (qscale - 1) | 1;
   qmul = qscale<<  1;

   for (i = 0; i<= nCoeffs; i++)
     {
       level = block[i];

       if (level<  0)
        level = level * qmul + qadd;
       else
        level = level * qmul - qadd;

       block[i] = level;
     }
}

The expression "level * qmul" is redundant and is optimized out
of the condition:

       level = level * qmul;
       if (level<  0)
        level += qadd;
       else
        level -= qadd;

On this code GCC fails to combine the + and the - with *, as they both
depend on the same computation.  However, if I am modifying the DCT
kernel to artificially remove the redundancy:

       if (level<  0)
        level = level * qmul + qadd;
       else
        level = level * qadd - qmul;

the kernel is vectorized with the expected insns:

        vpmacsdd        %xmm1, %xmm6, %xmm0, %xmm3
        vpmacsdd        %xmm5, %xmm1, %xmm0, %xmm2
        vpcomltd        %xmm4, %xmm0, %xmm0
        vpcmov  %xmm0, %xmm2, %xmm3, %xmm0

Here is the slower and larger code generated for the original DCT,
with one * and two +:

        vpmulld %xmm6, %xmm0, %xmm1
        vpcomltd        %xmm3, %xmm0, %xmm0
        vpaddd  %xmm5, %xmm1, %xmm2
        vpaddd  %xmm4, %xmm1, %xmm1
        vpcmov  %xmm0, %xmm1, %xmm2, %xmm0

Is there a simple way to teach combine how to introduce redundancy to
generate higher level insns?
Ouch. You've got another problem in that combine doesn't combine across basic blocks.

Can you attack it in forwprop?

I'm a little surprised DOM removed the multiplication -- it's not actually a runtime redundancy, it's more like a code hoisting since on any given iteration of the loop the expression level * qmul is only evaluated once.
Jeff

Reply via email to