On 06/21/10 16:13, Sebastian Pop wrote:
Hi,
I was looking at why, in the vectorized DCT kernel of FFmpeg, the insn
selection of GCC fails to produce XOP fused-multiply-add vector insns:
DOM is detecting a redundant expression that is optimized, and that
makes it impossible to detect the higher level insns in combine.
The DCT kernel looks like this:
static void
dct_unquantize_h263_inter_c (DCTELEM * block, int qscale, int nCoeffs)
{
int i, level, qmul, qadd;
qadd = (qscale - 1) | 1;
qmul = qscale<< 1;
for (i = 0; i<= nCoeffs; i++)
{
level = block[i];
if (level< 0)
level = level * qmul + qadd;
else
level = level * qmul - qadd;
block[i] = level;
}
}
The expression "level * qmul" is redundant and is optimized out
of the condition:
level = level * qmul;
if (level< 0)
level += qadd;
else
level -= qadd;
On this code GCC fails to combine the + and the - with *, as they both
depend on the same computation. However, if I am modifying the DCT
kernel to artificially remove the redundancy:
if (level< 0)
level = level * qmul + qadd;
else
level = level * qadd - qmul;
the kernel is vectorized with the expected insns:
vpmacsdd %xmm1, %xmm6, %xmm0, %xmm3
vpmacsdd %xmm5, %xmm1, %xmm0, %xmm2
vpcomltd %xmm4, %xmm0, %xmm0
vpcmov %xmm0, %xmm2, %xmm3, %xmm0
Here is the slower and larger code generated for the original DCT,
with one * and two +:
vpmulld %xmm6, %xmm0, %xmm1
vpcomltd %xmm3, %xmm0, %xmm0
vpaddd %xmm5, %xmm1, %xmm2
vpaddd %xmm4, %xmm1, %xmm1
vpcmov %xmm0, %xmm1, %xmm2, %xmm0
Is there a simple way to teach combine how to introduce redundancy to
generate higher level insns?
Ouch. You've got another problem in that combine doesn't combine across
basic blocks.
Can you attack it in forwprop?
I'm a little surprised DOM removed the multiplication -- it's not
actually a runtime redundancy, it's more like a code hoisting since on
any given iteration of the loop the expression level * qmul is only
evaluated once.
Jeff