Richard Biener <richard.guent...@gmail.com> 于2023年10月17日周二 17:26写道:
>
> On Thu, Oct 12, 2023 at 2:18 PM Hanke Zhang via Gcc <gcc@gcc.gnu.org> wrote:
> >
> > Hi, I'm recently working on vectorization of GCC. I'm stuck in a small
> > problem and would like to ask for advice.
> >
> > For example, for the following code:
> >
> > int main() {
> >   int size = 1000;
> >   int *foo = malloc(sizeof(int) * size);
> >   int c1 = rand(), t1 = rand();
> >
> >   for (int i = 0; i < size; i++) {
> >     if (foo[i] & c1) {
> >       foo[i] = t1;
> >     }
> >   }
> >
> >   // prevents the loop above from being optimized
> >   for (int i = 0; i < size; i++) {
> >     printf("%d", foo[i]);
> >   }
> > }
> >
> > First of all, the if statement block in the loop will be converted to
> > a MASK_STORE through if-conversion optimization. But after
> > tree-vector, it will still become a branched form. The part of the
> > final disassembly structure probably looks like below(Using IDA to do
> > this), and you can see that there is still such a branch 'if ( !_ZF )'
> > in it, which will lead to low efficiency.
> >
> > do
> >   {
> >     while ( 1 )
> >     {
> >       __asm
> >       {
> >         vpand   ymm0, ymm2, ymmword ptr [rax]
> >         vpcmpeqd ymm0, ymm0, ymm1
> >         vpcmpeqd ymm0, ymm0, ymm1
> >         vptest  ymm0, ymm0
> >       }
> >       if ( !_ZF )
> >         break;
> >       _RAX += 8;
> >       if ( _RAX == v9 )
> >         goto LABEL_5;
> >     }
> >     __asm { vpmaskmovd ymmword ptr [rax], ymm0, ymm3 }
> >     _RAX += 8;
> >   }
> >   while ( _RAX != v9 );
> >
> > Why can't we just replace the vptest and if statement with some other
> > instructions like vpblendvb so that it can be faster? Or is there a
> > good way to do that?
>
> The branch is added by optimize_mask_stores after vectorization because
> fully masked (disabled) masked stores can incur a quite heavy penalty on
> some architectures when fault assists (read-only pages, but also COW pages)
> are ran into.  All the microcode handling needs to possibly be carried out
> multiple times, for each such access to the same page.  That can cause
> a 1000x slowdown when you hit this case.  Thus every masked store
> is replaced by
>
>  if (mask != 0)
>    masked_store ();
>
> and this is an optimization (which itself has a small cost).
>
> Richard.

Yeah, I know that and I have seen the code of optimize_mask_store().
And the main problem here is that when multiple MASK_STORE appear in
the same loop, many branches will appear, resulting in a decrease in
overall efficiency.

And my original idea is that why can't we replace MASK_STORE with more
effective SIMD instructions because icc can do much better in this
case. Then I give it up, because the ability to analyze vectorization
of gcc is not as good as icc and my ability does not support me
modifying this part of the code.

Thanks very much for your reply.

>
> >
> > Thanks
> > Hanke Zhang

Reply via email to