Hi, I'm recently working on vectorization of GCC. I'm stuck in a small
problem and would like to ask for advice.
For example, for the following code:
int main() {
int size = 1000;
int *foo = malloc(sizeof(int) * size);
int c1 = rand(), t1 = rand();
for (int i = 0; i < size; i++) {
if (foo[i] & c1) {
foo[i] = t1;
}
}
// prevents the loop above from being optimized
for (int i = 0; i < size; i++) {
printf("%d", foo[i]);
}
}
First of all, the if statement block in the loop will be converted to
a MASK_STORE through if-conversion optimization. But after
tree-vector, it will still become a branched form. The part of the
final disassembly structure probably looks like below(Using IDA to do
this), and you can see that there is still such a branch 'if ( !_ZF )'
in it, which will lead to low efficiency.
do
{
while ( 1 )
{
__asm
{
vpand ymm0, ymm2, ymmword ptr [rax]
vpcmpeqd ymm0, ymm0, ymm1
vpcmpeqd ymm0, ymm0, ymm1
vptest ymm0, ymm0
}
if ( !_ZF )
break;
_RAX += 8;
if ( _RAX == v9 )
goto LABEL_5;
}
__asm { vpmaskmovd ymmword ptr [rax], ymm0, ymm3 }
_RAX += 8;
}
while ( _RAX != v9 );
Why can't we just replace the vptest and if statement with some other
instructions like vpblendvb so that it can be faster? Or is there a
good way to do that?
Thanks
Hanke Zhang