How to make parallelizing loops and vectorization work at the same time?

2023-09-15 Thread Hanke Zhang via Gcc
Hi I'm trying to accelerate my program with -ftree-vectorize and
-ftree-parallelize-loops.

Here are my test results using the different options (based on
gcc10.3.0 on i9-12900KF):
gcc-10 test.c -O3 -flto
> time: 29000 ms
gcc-10 test.c -O3 -flto -mavx2 -ftree-vectorize
> time: 17000 ms
gcc-10 test.c -O3 -flto -ftree-parallelize-loops=24
> time: 5000 ms
gcc-10 test.c -O3 -flto -ftree-parallelize-loops=24 -mavx2 -ftree-vectorize
> time: 5000 ms

I found that these two options do not work at the same time, that is,
if I use the `-ftree-vectorize` option alone, it can bring a big
efficiency gain compared to doing nothing; At the same time, if I use
the option of `-ftree-parallelize-loops` alone, it will also bring a
big efficiency gain. But if I use both options, vectorization fails,
that is, I can't get the benefits of vectorization, I can only get the
benefits of parallelizing loops.

I know that the reason may be that after parallelizing the loop,
vectorization cannot be performed, but is there any way I can reap the
benefits of both optimizations?

Here is my example program, adapted from the 462.libquantum in speccpu2006:

```
#include 
#include 
#include 

#define MAX_UNSIGNED unsigned long long

struct quantum_reg_node_struct {
float _Complex *amplitude; /* alpha_j */
MAX_UNSIGNED *state;   /* j */
};

typedef struct quantum_reg_node_struct quantum_reg_node;

struct quantum_reg_struct {
int width; /* number of qubits in the qureg */
int size;  /* number of non-zero vectors */
int hashw; /* width of the hash array */
quantum_reg_node *node;
int *hash;
};

typedef struct quantum_reg_struct quantum_reg;

void quantum_toffoli(int control1, int control2, int target, quantum_reg *reg) {
for (int i = 0; i < reg->size; i++) {
if (reg->node->state[i] & ((MAX_UNSIGNED)1 << control1)) {
if (reg->node->state[i] & ((MAX_UNSIGNED)1 << control2))  {
reg->node->state[i] ^= ((MAX_UNSIGNED)1 << target);
}
}
}
}

int get_random() {
return rand() % 64;
}

void init(quantum_reg *reg) {
reg->size = 2097152;
for (int i = 0; i < reg->size; i++)  {
reg->node = (quantum_reg_node *)malloc(sizeof(quantum_reg_node));
reg->node->state = (MAX_UNSIGNED *)malloc(sizeof(MAX_UNSIGNED)
* reg->size);
reg->node->amplitude = (float _Complex *)malloc(sizeof(float
_Complex) * reg->size);
if (i >= 1) break;
}
for (int i = 0; i < reg->size; i++)  {
reg->node->amplitude[i] = 0;
reg->node->state[i] = 0;
}
}

int main() {
quantum_reg reg;
init(®);
for (int i = 0; i < 65000; i++) {
quantum_toffoli(get_random(), get_random(), get_random(), ®);
}
}
```

Thanks so much.


Re: How to make parallelizing loops and vectorization work at the same time?

2023-09-15 Thread Hanke Zhang via Gcc
Richard Biener  于2023年9月15日周五 19:59写道:

>
> On Fri, Sep 15, 2023 at 1:21 PM Hanke Zhang via Gcc  wrote:
> >
> > Hi I'm trying to accelerate my program with -ftree-vectorize and
> > -ftree-parallelize-loops.
> >
> > Here are my test results using the different options (based on
> > gcc10.3.0 on i9-12900KF):
> > gcc-10 test.c -O3 -flto
> > > time: 29000 ms
> > gcc-10 test.c -O3 -flto -mavx2 -ftree-vectorize
> > > time: 17000 ms
> > gcc-10 test.c -O3 -flto -ftree-parallelize-loops=24
> > > time: 5000 ms
> > gcc-10 test.c -O3 -flto -ftree-parallelize-loops=24 -mavx2 -ftree-vectorize
> > > time: 5000 ms
> >
>
> First of all -O3 already enables -ftree-vectorize, adding -mavx2 is what 
> brings
> the first gain.  So adding -ftree-vectorize to the last command-line is not
> expected to change anything.  Instead you can use -fno-tree-vectorize on
> the second last one.  Doing that I get 111s vs 41s thus doing both helps.
>
> Note parallelization hasn't seen any development in the last years.
>
> Richard.

Hi Richard:

Thank you for your sincere reply.

I get what you mean above. But I still see the following after I add
`-fipo-info-vec`:

gcc-10 test.c -O3 -flto -mavx2 -fopt-info-vec
> test.c:29:5: optimized: loop vectorized using 32 byte vectors
gcc-10 test.c -O3 -flto -mavx2 -fopt-info-vec -ftree-parallelize-loops=24
> nothing happened

That means the vectorization does nothing help actually.

At the same time, I added `-fno-tree-vectorize` to the second last one
command. It did not bring about a performance change on my computer.

So I still think only parallel loops work.

Hanke Zhang

>
> > I found that these two options do not work at the same time, that is,
> > if I use the `-ftree-vectorize` option alone, it can bring a big
> > efficiency gain compared to doing nothing; At the same time, if I use
> > the option of `-ftree-parallelize-loops` alone, it will also bring a
> > big efficiency gain. But if I use both options, vectorization fails,
> > that is, I can't get the benefits of vectorization, I can only get the
> > benefits of parallelizing loops.
> >
> > I know that the reason may be that after parallelizing the loop,
> > vectorization cannot be performed, but is there any way I can reap the
> > benefits of both optimizations?
> >
> > Here is my example program, adapted from the 462.libquantum in speccpu2006:
> >
> > ```
> > #include 
> > #include 
> > #include 
> >
> > #define MAX_UNSIGNED unsigned long long
> >
> > struct quantum_reg_node_struct {
> > float _Complex *amplitude; /* alpha_j */
> > MAX_UNSIGNED *state;   /* j */
> > };
> >
> > typedef struct quantum_reg_node_struct quantum_reg_node;
> >
> > struct quantum_reg_struct {
> > int width; /* number of qubits in the qureg */
> > int size;  /* number of non-zero vectors */
> > int hashw; /* width of the hash array */
> > quantum_reg_node *node;
> > int *hash;
> > };
> >
> > typedef struct quantum_reg_struct quantum_reg;
> >
> > void quantum_toffoli(int control1, int control2, int target, quantum_reg 
> > *reg) {
> > for (int i = 0; i < reg->size; i++) {
> > if (reg->node->state[i] & ((MAX_UNSIGNED)1 << control1)) {
> > if (reg->node->state[i] & ((MAX_UNSIGNED)1 << control2))  {
> > reg->node->state[i] ^= ((MAX_UNSIGNED)1 << target);
> > }
> > }
> > }
> > }
> >
> > int get_random() {
> > return rand() % 64;
> > }
> >
> > void init(quantum_reg *reg) {
> > reg->size = 2097152;
> > for (int i = 0; i < reg->size; i++)  {
> > reg->node = (quantum_reg_node *)malloc(sizeof(quantum_reg_node));
> > reg->node->state = (MAX_UNSIGNED *)malloc(sizeof(MAX_UNSIGNED)
> > * reg->size);
> > reg->node->amplitude = (float _Complex *)malloc(sizeof(float
> > _Complex) * reg->size);
> > if (i >= 1) break;
> > }
> > for (int i = 0; i < reg->size; i++)  {
> > reg->node->amplitude[i] = 0;
> > reg->node->state[i] = 0;
> > }
> > }
> >
> > int main() {
> > quantum_reg reg;
> > init(®);
> > for (int i = 0; i < 65000; i++) {
> > quantum_toffoli(get_random(), get_random(), get_random(), ®);
> > }
> > }
> > ```
> >
> > Thanks so much.


Re: How to make parallelizing loops and vectorization work at the same time?

2023-09-15 Thread Hanke Zhang via Gcc
I get it. It's a `lto` problem. If I remove `-flto`, both work.

Thanks for your help again!

Richard Biener  于2023年9月15日周五 21:13写道:
>
> On Fri, Sep 15, 2023 at 3:09 PM Hanke Zhang  wrote:
> >
> > Richard Biener  于2023年9月15日周五 19:59写道:
> >
> > >
> > > On Fri, Sep 15, 2023 at 1:21 PM Hanke Zhang via Gcc  
> > > wrote:
> > > >
> > > > Hi I'm trying to accelerate my program with -ftree-vectorize and
> > > > -ftree-parallelize-loops.
> > > >
> > > > Here are my test results using the different options (based on
> > > > gcc10.3.0 on i9-12900KF):
> > > > gcc-10 test.c -O3 -flto
> > > > > time: 29000 ms
> > > > gcc-10 test.c -O3 -flto -mavx2 -ftree-vectorize
> > > > > time: 17000 ms
> > > > gcc-10 test.c -O3 -flto -ftree-parallelize-loops=24
> > > > > time: 5000 ms
> > > > gcc-10 test.c -O3 -flto -ftree-parallelize-loops=24 -mavx2 
> > > > -ftree-vectorize
> > > > > time: 5000 ms
> > > >
> > >
> > > First of all -O3 already enables -ftree-vectorize, adding -mavx2 is what 
> > > brings
> > > the first gain.  So adding -ftree-vectorize to the last command-line is 
> > > not
> > > expected to change anything.  Instead you can use -fno-tree-vectorize on
> > > the second last one.  Doing that I get 111s vs 41s thus doing both helps.
> > >
> > > Note parallelization hasn't seen any development in the last years.
> > >
> > > Richard.
> >
> > Hi Richard:
> >
> > Thank you for your sincere reply.
> >
> > I get what you mean above. But I still see the following after I add
> > `-fipo-info-vec`:
> >
> > gcc-10 test.c -O3 -flto -mavx2 -fopt-info-vec
> > > test.c:29:5: optimized: loop vectorized using 32 byte vectors
> > gcc-10 test.c -O3 -flto -mavx2 -fopt-info-vec -ftree-parallelize-loops=24
> > > nothing happened
> >
> > That means the vectorization does nothing help actually.
> >
> > At the same time, I added `-fno-tree-vectorize` to the second last one
> > command. It did not bring about a performance change on my computer.
> >
> > So I still think only parallel loops work.
>
> I checked GCC 13 and do see vectorized loops when parallelizing.
>
> Richard.
>
> > Hanke Zhang
> >
> > >
> > > > I found that these two options do not work at the same time, that is,
> > > > if I use the `-ftree-vectorize` option alone, it can bring a big
> > > > efficiency gain compared to doing nothing; At the same time, if I use
> > > > the option of `-ftree-parallelize-loops` alone, it will also bring a
> > > > big efficiency gain. But if I use both options, vectorization fails,
> > > > that is, I can't get the benefits of vectorization, I can only get the
> > > > benefits of parallelizing loops.
> > > >
> > > > I know that the reason may be that after parallelizing the loop,
> > > > vectorization cannot be performed, but is there any way I can reap the
> > > > benefits of both optimizations?
> > > >
> > > > Here is my example program, adapted from the 462.libquantum in 
> > > > speccpu2006:
> > > >
> > > > ```
> > > > #include 
> > > > #include 
> > > > #include 
> > > >
> > > > #define MAX_UNSIGNED unsigned long long
> > > >
> > > > struct quantum_reg_node_struct {
> > > > float _Complex *amplitude; /* alpha_j */
> > > > MAX_UNSIGNED *state;   /* j */
> > > > };
> > > >
> > > > typedef struct quantum_reg_node_struct quantum_reg_node;
> > > >
> > > > struct quantum_reg_struct {
> > > > int width; /* number of qubits in the qureg */
> > > > int size;  /* number of non-zero vectors */
> > > > int hashw; /* width of the hash array */
> > > > quantum_reg_node *node;
> > > > int *hash;
> > > > };
> > > >
> > > > typedef struct quantum_reg_struct quantum_reg;
> > > >
> > > > void quantum_toffoli(int control1, int control2, int target, 
> > > > quantum_reg *reg) {
> > > > for (int i = 0; i < reg->size; i++) {
> > > > if (reg->node->state[i] & ((MAX_UNSIGNED)1 << control1)) {
> > > > if (reg->node->state[i] & ((

How to enable DCE during late_ipa_passes?

2023-09-18 Thread Hanke Zhang via Gcc
Hi, I am currently developing a new pass in the late_ipa_passes phase,
located behind the pass_ipa_pta and in front of the
pass_omp_simd_clone. I now hope that after the pass I developed, the
code can be vectorized, that is, the code that can be vectorized
during the pass_omp_simd_clone.

But not yet, I'm guessing that maybe it's because I added some dead
code during the pass I developed, so I want to be followed by a DCE
pass right after my pass, hoping that this will allow
pass_omp_simd_clone to vectorize the code after my pass.

But I've found that adding NEXT_PASS(pass_dce) directly after my pass
results in a runtime error, so I would like to ask if there is any
good way to get pass_dce to work in the late_ipa_passes phase?

Hanke Zhang.


Re: How to enable DCE during late_ipa_passes?

2023-09-18 Thread Hanke Zhang via Gcc
Thanks! That works!

Richard Biener  于2023年9月18日周一 15:56写道:
>
> On Mon, Sep 18, 2023 at 9:17 AM Hanke Zhang via Gcc  wrote:
> >
> > Hi, I am currently developing a new pass in the late_ipa_passes phase,
> > located behind the pass_ipa_pta and in front of the
> > pass_omp_simd_clone. I now hope that after the pass I developed, the
> > code can be vectorized, that is, the code that can be vectorized
> > during the pass_omp_simd_clone.
> >
> > But not yet, I'm guessing that maybe it's because I added some dead
> > code during the pass I developed, so I want to be followed by a DCE
> > pass right after my pass, hoping that this will allow
> > pass_omp_simd_clone to vectorize the code after my pass.
> >
> > But I've found that adding NEXT_PASS(pass_dce) directly after my pass
> > results in a runtime error, so I would like to ask if there is any
> > good way to get pass_dce to work in the late_ipa_passes phase?
>
> You could try doing
>
>   NEXT_PASS (pass_your_ipa_pass)
>   PUSH_INSERT_PASSES_WITHIN (pass_your_ipa_pass)
> NEXT_PASS (pass_dce);
>   POP_INSERT_PASES ()
>   NEXT_PASS (pass_omp_simd_clone)
>
>
> > Hanke Zhang.


Loop fusion in gcc

2023-09-23 Thread Hanke Zhang via Gcc
Hi, I have been very interested in loop fusion recently. I found that
both LLVM and icc have implemented this optimization. I also noticed
that gcc does not seem to implement it.

I would like to ask if gcc have any plans to implement this
optimization? In addition, I also found that there is a function
called `fuse_loop` in gcc/gimple-loop-jam.cc. Can it achieve the
purpose of loop fusion?

Hanke Zhang.


The order of loop traversal in gcc

2023-09-24 Thread Hanke Zhang via Gcc
Hi, I have recently been working on loops in gcc, and I have some
questions about the loop traversal.

I use loops_list(cfun, LI_ONLY_INNERMOST) to traverse the loops in my
pass to obtain the loop.

I found that the order of traversal and the order of actual
instruction execution will be different.

Sometimes it's the exact opposite, and sometimes it's sequential. I
would like to ask how to predict its behavior? And what method should
I use if I want to obtain sequential traversal?

Thanks.
Hanke Zhang.


Re: The order of loop traversal in gcc

2023-09-25 Thread Hanke Zhang via Gcc
Richard Biener  于2023年9月25日周一 13:46写道:
>
>
>
> > Am 25.09.2023 um 04:53 schrieb Hanke Zhang via Gcc :
> >
> > Hi, I have recently been working on loops in gcc, and I have some
> > questions about the loop traversal.
> >
> > I use loops_list(cfun, LI_ONLY_INNERMOST) to traverse the loops in my
> > pass to obtain the loop.
> >
> > I found that the order of traversal and the order of actual
> > instruction execution will be different.
> >
> > Sometimes it's the exact opposite, and sometimes it's sequential. I
> > would like to ask how to predict its behavior? And what method should
> > I use if I want to obtain sequential traversal?
>
> The order of loops is according to their index, if you require traversal in 
> program order you have to order them yourself, for example by sorting them 
> after their header program order.
>
> Richard
>

Thanks. Get it!

> > Thanks.
> > Hanke Zhang.


Question about merging if-else blocks

2023-09-26 Thread Hanke Zhang via Gcc
Hi, I have recently been working on merging if-else statement blocks,
and I found a rather bizarre phenomenon that I would like to ask
about.
A rough explanation is that for two consecutive if-else blocks, if
their if statements are exactly the same, they should be merged, like
the following program:

int a = atoi(argv[1]);
if (a) {
  printf("if 1");
} else {
  printf("else 1");
}
if (a) {
  printf("if 2");
} else {
  printf("else 2");
}

After using the -O3 -flto optimization option, it can be optimized as follows:

int a = atoi(argv[1]);
if (a) {
  printf("if 1");
  printf("if 2");
} else {
  printf("else 1");
  printf("else 2");
}

But `a` here is a local variable. If I declare a as a global variable,
it cannot be optimized as above. I would like to ask why this is? And
is there any solution?

Thanks.
Hanke Zhang.


Re: Question about merging if-else blocks

2023-09-26 Thread Hanke Zhang via Gcc
Thanks! I understand what you mean, then can I think that if the
function here is not an external function, but a function visible to
the compiler and the function doesn't modify `a`, then these two
blocks can be merged?

Marc Glisse  于2023年9月27日周三 12:51写道:
>
> On Wed, 27 Sep 2023, Hanke Zhang via Gcc wrote:
>
> > Hi, I have recently been working on merging if-else statement blocks,
> > and I found a rather bizarre phenomenon that I would like to ask
> > about.
> > A rough explanation is that for two consecutive if-else blocks, if
> > their if statements are exactly the same, they should be merged, like
> > the following program:
> >
> > int a = atoi(argv[1]);
> > if (a) {
> >  printf("if 1");
> > } else {
> >  printf("else 1");
> > }
> > if (a) {
> >  printf("if 2");
> > } else {
> >  printf("else 2");
> > }
> >
> > After using the -O3 -flto optimization option, it can be optimized as 
> > follows:
> >
> > int a = atoi(argv[1]);
> > if (a) {
> >  printf("if 1");
> >  printf("if 2");
> > } else {
> >  printf("else 1");
> >  printf("else 2");
> > }
> >
> > But `a` here is a local variable. If I declare a as a global variable,
> > it cannot be optimized as above. I would like to ask why this is? And
> > is there any solution?
>
> If 'a' is a global variable, how do you know 'printf' doesn't modify its
> value? (you could know it for printf, but it really depends on the
> function that is called)
>
> --
> Marc Glisse


Check whether the global variable has been modified in a function or some blocks

2023-09-29 Thread Hanke Zhang via Gcc
Hi, I have recently been working on issues related to the changing
values of global variables. That is, I was trying to develop a gimple
pass, which needs to check whether the value of a global variable is
modified in the a function or some blocks.

Some of the more tricky cases are as follows:

int type; // global variable
void foo() {
  int tmp;
  tmp = type; // store type in tmp
  type = 0;
  // do other things
  // type will be used or changed
  // ...
  type = tmp;
  return;
}

Obviously, this function does not modify the value of type, but from
the compiler's point of view, it seems not so obvious. I'd like to ask
is there any good way to solve it? Or does gcc provide some ready-made
implementations to use already?


Re: Question about merging if-else blocks

2023-09-30 Thread Hanke Zhang via Gcc
Richard Biener  于2023年9月27日周三 15:30写道:
>
> On Wed, Sep 27, 2023 at 7:21 AM Hanke Zhang via Gcc  wrote:
> >
> > Thanks! I understand what you mean, then can I think that if the
> > function here is not an external function, but a function visible to
> > the compiler and the function doesn't modify `a`, then these two
> > blocks can be merged?
>
> Yes.  The key transform you'd see before any of the merging is
> CSE of the loads from 'a', then the rest is equivalent to the local
> variable case.
>
> Richard.

Hi, Richard

I'm still a little confused about this.

I want to change the default behavior of gcc. We know that printf
won't change the value of 'a'. I'd like to let the compiler to get
this information as well. How can I do that? Or which pass should I
focus on?

By disassembling the exe file generated by icc, I found that icc will
merge these two blocks with the example code below. So I think there
maybe some ways to make it.

Thanks.
Hanke Zhang.

>
> > Marc Glisse  于2023年9月27日周三 12:51写道:
> > >
> > > On Wed, 27 Sep 2023, Hanke Zhang via Gcc wrote:
> > >
> > > > Hi, I have recently been working on merging if-else statement blocks,
> > > > and I found a rather bizarre phenomenon that I would like to ask
> > > > about.
> > > > A rough explanation is that for two consecutive if-else blocks, if
> > > > their if statements are exactly the same, they should be merged, like
> > > > the following program:
> > > >
> > > > int a = atoi(argv[1]);
> > > > if (a) {
> > > >  printf("if 1");
> > > > } else {
> > > >  printf("else 1");
> > > > }
> > > > if (a) {
> > > >  printf("if 2");
> > > > } else {
> > > >  printf("else 2");
> > > > }
> > > >
> > > > After using the -O3 -flto optimization option, it can be optimized as 
> > > > follows:
> > > >
> > > > int a = atoi(argv[1]);
> > > > if (a) {
> > > >  printf("if 1");
> > > >  printf("if 2");
> > > > } else {
> > > >  printf("else 1");
> > > >  printf("else 2");
> > > > }
> > > >
> > > > But `a` here is a local variable. If I declare a as a global variable,
> > > > it cannot be optimized as above. I would like to ask why this is? And
> > > > is there any solution?
> > >
> > > If 'a' is a global variable, how do you know 'printf' doesn't modify its
> > > value? (you could know it for printf, but it really depends on the
> > > function that is called)
> > >
> > > --
> > > Marc Glisse


Question about function splitting

2023-10-02 Thread Hanke Zhang via Gcc
Hi, I have some questions about the strategy and behavior of function
splitting in gcc, like the following code:

int glob;
void f() {
  if (glob) {
printf("short path\n");
return;
  }
  // do lots of expensive things
  // ...
}

I hope it can be broken down like below, so that the whole function
can perhaps be inlined, which is more efficient.

int glob;
void f() {
  if (glob) {
printf("short path\n");
return;
  }
  f_part();
}

void f_part() {
  // do lots of expensive things
  // ...
}


But on the contrary, gcc splits it like these, which not only does not
bring any benefits, but may increase the time consumption, because the
function call itself is a more resource-intensive thing.

int glob;
void f() {
  if (glob) {
f_part();
return;
  }
  // do lots of expensive things
  // ...
}

void f_part() {
  printf("short path\n"); // just do this
}

Are there any options I can offer to gcc to change this behavior? Or
do I need to make some changes in ipa-split.cc?


Re: Question about function splitting

2023-10-02 Thread Hanke Zhang via Gcc
Martin Jambor  于2023年10月3日周二 00:34写道:
>
> Hello,
>
> On Mon, Oct 02 2023, Hanke Zhang via Gcc wrote:
> > Hi, I have some questions about the strategy and behavior of function
> > splitting in gcc, like the following code:
> >
> > int glob;
> > void f() {
> >   if (glob) {
> > printf("short path\n");
> > return;
> >   }
> >   // do lots of expensive things
> >   // ...
> > }
> >
> > I hope it can be broken down like below, so that the whole function
> > can perhaps be inlined, which is more efficient.
> >
> > int glob;
> > void f() {
> >   if (glob) {
> > printf("short path\n");
> > return;
> >   }
> >   f_part();
> > }
> >
> > void f_part() {
> >   // do lots of expensive things
> >   // ...
> > }
> >
> >
> > But on the contrary, gcc splits it like these, which not only does not
> > bring any benefits, but may increase the time consumption, because the
> > function call itself is a more resource-intensive thing.
> >
> > int glob;
> > void f() {
> >   if (glob) {
> > f_part();
> > return;
> >   }
> >   // do lots of expensive things
> >   // ...
> > }
> >
> > void f_part() {
> >   printf("short path\n"); // just do this
> > }
> >
> > Are there any options I can offer to gcc to change this behavior? Or
> > do I need to make some changes in ipa-split.cc?
>
> I'd suggest you file a bug to Bugzilla with a specific example that is
> mis-handled, then we can have a look and discuss what and why happens
> and what can be done about it.
>
> Thanks,
>
> Martin

Hi, thanks for your reply.

I'm trying to create an account right now. And I put a copy of the
example code here in case someone is interested.

And I'm using gcc 12.3.0. When you complie the code below via 'gcc
test.c -O3 -flto -fdump-tree-fnsplit', you will find a phenomenon that
is consistent with what I described above in the gimple which is
dumped from fnsplit.

#include 
#include 

int opstatus;
unsigned char *objcode = 0;
unsigned long position = 0;
char *globalfile;

int test_split_write(char *file) {
  FILE *fhd;

  if (!opstatus) {
// short path here
printf("Object code generation not active! Forgot to call "
   "quantum_objcode_start?\n");
return 1;
  }

  if (!file)
file = globalfile;

  fhd = fopen(file, "w");

  if (fhd == 0)
return -1;

  fwrite(objcode, position, 1, fhd);

  fclose(fhd);

  int *arr = malloc(1000);
  for (int i = 0; i < 1000; i++) {
arr[i] = rand();
  }

  return 0;
}

// to avoid `test_split_write` inlining into main
void __attribute__((noinline)) call() { test_split_write("./txt"); }

int main() {
  opstatus = rand();
  objcode = malloc(100);
  position = 0;
  call();
  return 0;
}


Function return value can't be infered when it's not inlined

2023-10-03 Thread Hanke Zhang via Gcc
Hi, I'm a little confused about the behavior of gcc when the function
is not inlined.

Here is an example code:

int __attribute__((noinline)) foo() {
return 1;
}

int main() {
if (foo()) {
printf("foo() returned 1\n");
} else {
printf("foo() returned 0\n");
}
return 0;
}

After compiling this via `-O3 -flto`, the else block isn't been
optimized and still exists.

Even it's so obvious that the function will return '1', can't the
compiler see that? Does gcc only get this information by inlining the
function? Or is that what the gcc does?

If so, how to make a change to let gcc get this information then?

Thanks
Hanke Zhang


Re: Function return value can't be infered when it's not inlined

2023-10-04 Thread Hanke Zhang via Gcc
Richard Biener  于2023年10月4日周三 16:43写道:
>
> On Wed, Oct 4, 2023 at 10:37 AM Richard Biener
>  wrote:
> >
> > On Tue, Oct 3, 2023 at 6:30 PM Hanke Zhang via Gcc  wrote:
> > >
> > > Hi, I'm a little confused about the behavior of gcc when the function
> > > is not inlined.
> > >
> > > Here is an example code:
> > >
> > > int __attribute__((noinline)) foo() {
> > > return 1;
> > > }
> > >
> > > int main() {
> > > if (foo()) {
> > > printf("foo() returned 1\n");
> > > } else {
> > > printf("foo() returned 0\n");
> > > }
> > > return 0;
> > > }
> > >
> > > After compiling this via `-O3 -flto`, the else block isn't been
> > > optimized and still exists.
> > >
> > > Even it's so obvious that the function will return '1', can't the
> > > compiler see that? Does gcc only get this information by inlining the
> > > function? Or is that what the gcc does?
> > >
> > > If so, how to make a change to let gcc get this information then?
> >
> > I think IPA-CP would be doing this but the issue is that historically
> > 'noinline' also disabled other IPA transforms and we've kept that
> > for backward compatibility even when we introduced the separate 'noipa'
> > attribute.

Thanks. The initial example is a function that uses va_args as
parameters. It cannot be inlined because of va_args, and then its
return value cannot be predicted like above. For example, the
following function:

int foo (int num, ...) {
va_list args;
va_start(args, num);
int a1 = va_arg(args, int);
int a2 = va_arg(args, int);
printf("a1 = %d, a2 = %d\n", a1, a2);
va_end(args);
return 1;
}

int main() {
if (foo(2, rand(), rand())) {
printf("foo() returned 1\n");
} else {
printf("foo() returned 0\n");
}
return 0;
}

Wouldn't such a function be optimized like 'noinline'?

>
> Oh, and I forgot that IIRC neither IPA CP nor IPA SRA handle return
> functions in a way exposing this to optimization passes (there's no
> way to encode this in fnspec, we'd need some return value value-range
> and record that and make VRP/ranger query it on calls).
>
> Richard.
>

Thanks. So, does that mean I have to let VRP/ranger be able to query
the return value so that the else block can be optimized out?

> > Richard.
> >
> > >
> > > Thanks
> > > Hanke Zhang


Re: Question about function splitting

2023-10-04 Thread Hanke Zhang via Gcc
But when I change the code 'opstatus = rand()' to 'opstatus = rand()
%2', the probability of opstatus being 0 should be 50%, but the result
remains the same, i.e. still split at that point.

And the specific information can be found in Bugzilla, the link is
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111672

Richard Biener  于2023年10月4日周三 16:20写道:
>
> On Mon, Oct 2, 2023 at 7:15 PM Hanke Zhang via Gcc  wrote:
> >
> > Martin Jambor  于2023年10月3日周二 00:34写道:
> > >
> > > Hello,
> > >
> > > On Mon, Oct 02 2023, Hanke Zhang via Gcc wrote:
> > > > Hi, I have some questions about the strategy and behavior of function
> > > > splitting in gcc, like the following code:
> > > >
> > > > int glob;
> > > > void f() {
> > > >   if (glob) {
> > > > printf("short path\n");
> > > > return;
> > > >   }
> > > >   // do lots of expensive things
> > > >   // ...
> > > > }
> > > >
> > > > I hope it can be broken down like below, so that the whole function
> > > > can perhaps be inlined, which is more efficient.
> > > >
> > > > int glob;
> > > > void f() {
> > > >   if (glob) {
> > > > printf("short path\n");
> > > > return;
> > > >   }
> > > >   f_part();
> > > > }
> > > >
> > > > void f_part() {
> > > >   // do lots of expensive things
> > > >   // ...
> > > > }
> > > >
> > > >
> > > > But on the contrary, gcc splits it like these, which not only does not
> > > > bring any benefits, but may increase the time consumption, because the
> > > > function call itself is a more resource-intensive thing.
> > > >
> > > > int glob;
> > > > void f() {
> > > >   if (glob) {
> > > > f_part();
> > > > return;
> > > >   }
> > > >   // do lots of expensive things
> > > >   // ...
> > > > }
> > > >
> > > > void f_part() {
> > > >   printf("short path\n"); // just do this
> > > > }
> > > >
> > > > Are there any options I can offer to gcc to change this behavior? Or
> > > > do I need to make some changes in ipa-split.cc?
> > >
> > > I'd suggest you file a bug to Bugzilla with a specific example that is
> > > mis-handled, then we can have a look and discuss what and why happens
> > > and what can be done about it.
> > >
> > > Thanks,
> > >
> > > Martin
> >
> > Hi, thanks for your reply.
> >
> > I'm trying to create an account right now. And I put a copy of the
> > example code here in case someone is interested.
> >
> > And I'm using gcc 12.3.0. When you complie the code below via 'gcc
> > test.c -O3 -flto -fdump-tree-fnsplit', you will find a phenomenon that
> > is consistent with what I described above in the gimple which is
> > dumped from fnsplit.
>
> I think fnsplit currently splits out _cold_ code, I suppose !opstatus
> is predicted to be false most of the time.
>
> It looks like your intent is to inline this very early check as
>
>   if (!opstatus) { test_split_write_1 (..); } else { test_split_write_2 (..); 
> }
>
> to possibly elide that test?  I would guess that IPA-CP is supposed to
> do this but eventually refuses to create a clone for this case since
> it would be large.
>
> Unfortunately function splitting doesn't run during IPA transforms,
> but maybe IPA-CP can be teached how to avoid the expensive clone
> by performing what IPA split does in the case a check in the entry
> block which splits control flow can be optimized?
>
> Richard.
>
> > #include 
> > #include 
> >
> > int opstatus;
> > unsigned char *objcode = 0;
> > unsigned long position = 0;
> > char *globalfile;
> >
> > int test_split_write(char *file) {
> >   FILE *fhd;
> >
> >   if (!opstatus) {
> > // short path here
> > printf("Object code generation not active! Forgot to call "
> >"quantum_objcode_start?\n");
> > return 1;
> >   }
> >
> >   if (!file)
> > file = globalfile;
> >
> >   fhd = fopen(file, "w");
> >
> >   if (fhd == 0)
> > return -1;
> >
> >   fwrite(objcode, position, 1, fhd);
> >
> >   fclose(fhd);
> >
> >   int *arr = malloc(1000);
> >   for (int i = 0; i < 1000; i++) {
> > arr[i] = rand();
> >   }
> >
> >   return 0;
> > }
> >
> > // to avoid `test_split_write` inlining into main
> > void __attribute__((noinline)) call() { test_split_write("./txt"); }
> >
> > int main() {
> >   opstatus = rand();
> >   objcode = malloc(100);
> >   position = 0;
> >   call();
> >   return 0;
> > }


why not optimize static local variables

2023-10-07 Thread Hanke Zhang via Gcc
Hi, I've recently been working on static local variables in C. I would
like to ask about some questions about that.

For example, for the following program,

void foo() {
  static int x = 0;
  x++;
}

int main() {
  foo();
}

After optimization with the -O3 -flto option, the entire program will
look something like this:

int main() {
  x_foo++;
}

The question I want to ask is, why not optimize the 'x_foo++' line of
code out? Because its scope will only be in foo, and it will not be
read at all for the entire program. Is it because it is stored in the
global data area? Or are there other security issues?

Thanks
Hanke Zhang


the elimination of if blocks in GCC during if-conversion and vectorization

2023-10-12 Thread Hanke Zhang via Gcc
Hi, I'm recently working on vectorization of GCC. I'm stuck in a small
problem and would like to ask for advice.

For example, for the following code:

int main() {
  int size = 1000;
  int *foo = malloc(sizeof(int) * size);
  int c1 = rand(), t1 = rand();

  for (int i = 0; i < size; i++) {
if (foo[i] & c1) {
  foo[i] = t1;
}
  }

  // prevents the loop above from being optimized
  for (int i = 0; i < size; i++) {
printf("%d", foo[i]);
  }
}

First of all, the if statement block in the loop will be converted to
a MASK_STORE through if-conversion optimization. But after
tree-vector, it will still become a branched form. The part of the
final disassembly structure probably looks like below(Using IDA to do
this), and you can see that there is still such a branch 'if ( !_ZF )'
in it, which will lead to low efficiency.

do
  {
while ( 1 )
{
  __asm
  {
vpand   ymm0, ymm2, ymmword ptr [rax]
vpcmpeqd ymm0, ymm0, ymm1
vpcmpeqd ymm0, ymm0, ymm1
vptest  ymm0, ymm0
  }
  if ( !_ZF )
break;
  _RAX += 8;
  if ( _RAX == v9 )
goto LABEL_5;
}
__asm { vpmaskmovd ymmword ptr [rax], ymm0, ymm3 }
_RAX += 8;
  }
  while ( _RAX != v9 );

Why can't we just replace the vptest and if statement with some other
instructions like vpblendvb so that it can be faster? Or is there a
good way to do that?

Thanks
Hanke Zhang


Question about gimple code during optimizing if-conversion

2023-10-13 Thread Hanke Zhang via Gcc
Hi, I'm working on optimizing if-conversion for my own business
recently. I got a problem here.

I tried to optimize it in such a case, for example, when a conditional
statement block has only if statement and no else statement, the
source C code looks like this:

int* foo; // assume this has been initialized
int c = rand(), t = rand(), size = 1000;
for (int i = 0; i < size; i++) {
  if (foo[i] & (1 << c)) foo[i] ^= (1 << t);
}

Part of its corresponding gimple is optimized like this before if-conversion:

  :
  # i_71 = PHI 
  # ivtmp_9 = PHI 
  _5 = (long unsigned int) i_71;
  _6 = _5 * 4;
  _7 = foo_23 + _6;
  _8 = *_7;
  shifttmp_75 = _8 & shifttmp_76;
  if (shifttmp_75 != 0)
goto ; [50.00%]
  else
goto ; [50.00%]

   [local count: 531502205]:
  goto ; [100.00%]

   [local count: 531502204]:
  _12 = _8 ^ _11;
  *_7 = _12;

   [local count: 1063004409]:
  i_39 = i_71 + 1;
  ivtmp_73 = ivtmp_9 - 1;
  if (ivtmp_73 != 0)
goto ; [99.00%]
  else
goto ; [1.00%]

I want to add some statements to gimple to make it like adding an else
block to the source code.

// What I expected:
int* foo; // assume this has been initialized
int c = rand(), t = rand(), size = 1000;
for (int i = 0; i < size; i++) {
  if (foo[i] & (1 << c)) foo[i] ^= (1 << t);
+  else foo[i] = foo[i];  // I want to add a statment here !
}

And of course I can't change the source code for real, so I can only
add a pass in front of if-conversion to modify the gimple.

For the example above, I know that I have to add them in the block
'', but what confuses me is that I don't know what kind of
statement to add to be legal due to my poor experience.

I try to add something like this below, but the compile error just
happened. So I'm here for help. What kind of statements should I add
here?

 [local count: 531502205]:
+ *_7 = *_7
 goto ; [100.00%]

Finally, The reason I did this was to avoid MASK_STORE generation,
because it might add an if branch in the final assembly which I don't
like it to be. And after such a modification, if-conversion should
have been changed it to the form of a ternary expression, which would
reduce the occurrence of branches after final vectorization and
produce more efficient code.

Or there if is a better way to get rid of MASK_STORE, please tell me
about that. :)

Thanks
Hanke Zhang


Check whether a function is a pure function

2023-10-17 Thread Hanke Zhang via Gcc
Hi, I'm trying to write a pass to erase some useless functions or to
put it another way, detect whether a function is pure or not. (Of
course I know some passes can do the clean after inline)

Here is the problem I got, If a function satisfy the following points,
can it be considered safe to delete?

1. return value is ignored or has no return
2. params are pure 32bit Integer Const or has no params
3. No use of global variable
4. No function call, assembly code, goto statement
5. No cast

And in my pass, I try to do these things through these points:

1. check GIMPLE_RETURN statement and check all the function call points
2. check DECL_PARAM(fndecl) and check the input params at all the
function call points
3. check that fndecl is not in the global_var->referring list
4. check GIMPLE_CALL, GIMPLE_ASM, GIMPLE_GOTO are not present
5. check that there is no difference between TREE_TYPE(lhs) and
TREE_TYPE(rhs1) types in the GIMPLE_ASSIGN statement

I would like to ask that if there are any omissions or errors in the
prerequisites and corresponding implementation plans I listed above.

Thanks
Hanke Zhang


Re: the elimination of if blocks in GCC during if-conversion and vectorization

2023-10-17 Thread Hanke Zhang via Gcc
Richard Biener  于2023年10月17日周二 17:26写道:
>
> On Thu, Oct 12, 2023 at 2:18 PM Hanke Zhang via Gcc  wrote:
> >
> > Hi, I'm recently working on vectorization of GCC. I'm stuck in a small
> > problem and would like to ask for advice.
> >
> > For example, for the following code:
> >
> > int main() {
> >   int size = 1000;
> >   int *foo = malloc(sizeof(int) * size);
> >   int c1 = rand(), t1 = rand();
> >
> >   for (int i = 0; i < size; i++) {
> > if (foo[i] & c1) {
> >   foo[i] = t1;
> > }
> >   }
> >
> >   // prevents the loop above from being optimized
> >   for (int i = 0; i < size; i++) {
> > printf("%d", foo[i]);
> >   }
> > }
> >
> > First of all, the if statement block in the loop will be converted to
> > a MASK_STORE through if-conversion optimization. But after
> > tree-vector, it will still become a branched form. The part of the
> > final disassembly structure probably looks like below(Using IDA to do
> > this), and you can see that there is still such a branch 'if ( !_ZF )'
> > in it, which will lead to low efficiency.
> >
> > do
> >   {
> > while ( 1 )
> > {
> >   __asm
> >   {
> > vpand   ymm0, ymm2, ymmword ptr [rax]
> > vpcmpeqd ymm0, ymm0, ymm1
> > vpcmpeqd ymm0, ymm0, ymm1
> > vptest  ymm0, ymm0
> >   }
> >   if ( !_ZF )
> > break;
> >   _RAX += 8;
> >   if ( _RAX == v9 )
> > goto LABEL_5;
> > }
> > __asm { vpmaskmovd ymmword ptr [rax], ymm0, ymm3 }
> > _RAX += 8;
> >   }
> >   while ( _RAX != v9 );
> >
> > Why can't we just replace the vptest and if statement with some other
> > instructions like vpblendvb so that it can be faster? Or is there a
> > good way to do that?
>
> The branch is added by optimize_mask_stores after vectorization because
> fully masked (disabled) masked stores can incur a quite heavy penalty on
> some architectures when fault assists (read-only pages, but also COW pages)
> are ran into.  All the microcode handling needs to possibly be carried out
> multiple times, for each such access to the same page.  That can cause
> a 1000x slowdown when you hit this case.  Thus every masked store
> is replaced by
>
>  if (mask != 0)
>masked_store ();
>
> and this is an optimization (which itself has a small cost).
>
> Richard.

Yeah, I know that and I have seen the code of optimize_mask_store().
And the main problem here is that when multiple MASK_STORE appear in
the same loop, many branches will appear, resulting in a decrease in
overall efficiency.

And my original idea is that why can't we replace MASK_STORE with more
effective SIMD instructions because icc can do much better in this
case. Then I give it up, because the ability to analyze vectorization
of gcc is not as good as icc and my ability does not support me
modifying this part of the code.

Thanks very much for your reply.

>
> >
> > Thanks
> > Hanke Zhang


Re: the elimination of if blocks in GCC during if-conversion and vectorization

2023-10-17 Thread Hanke Zhang via Gcc
Hi Richard
I get it, thank you again.

And I got another problem, so I'd like ask it by the way. Can the left
shift of the induction variable in a loop be optimized as a constant?
Like the code below:

int ans = 0;
int width = rand() % 16;
for (int j = 0; j < width; j++)
  ans += 1 << (j + width)

into:

int width = rand() % 16;
ans = (1 << (2 * width) - (1 << width));

I came across a more complex version of that and found that gcc
doesn't seem to handle it, so wanted to write a pass myself to
optimize it.

I got two questions here. Does GCC have such optimizations? If I want
to do my own optimization, where should I put it? Put it behind the
pass_iv_optimize?

Thanks
Hanke Zhang

Richard Biener  于2023年10月17日周二 20:00写道:
>
> On Tue, Oct 17, 2023 at 1:54 PM Hanke Zhang  wrote:
> >
> > Richard Biener  于2023年10月17日周二 17:26写道:
> > >
> > > On Thu, Oct 12, 2023 at 2:18 PM Hanke Zhang via Gcc  
> > > wrote:
> > > >
> > > > Hi, I'm recently working on vectorization of GCC. I'm stuck in a small
> > > > problem and would like to ask for advice.
> > > >
> > > > For example, for the following code:
> > > >
> > > > int main() {
> > > >   int size = 1000;
> > > >   int *foo = malloc(sizeof(int) * size);
> > > >   int c1 = rand(), t1 = rand();
> > > >
> > > >   for (int i = 0; i < size; i++) {
> > > > if (foo[i] & c1) {
> > > >   foo[i] = t1;
> > > > }
> > > >   }
> > > >
> > > >   // prevents the loop above from being optimized
> > > >   for (int i = 0; i < size; i++) {
> > > > printf("%d", foo[i]);
> > > >   }
> > > > }
> > > >
> > > > First of all, the if statement block in the loop will be converted to
> > > > a MASK_STORE through if-conversion optimization. But after
> > > > tree-vector, it will still become a branched form. The part of the
> > > > final disassembly structure probably looks like below(Using IDA to do
> > > > this), and you can see that there is still such a branch 'if ( !_ZF )'
> > > > in it, which will lead to low efficiency.
> > > >
> > > > do
> > > >   {
> > > > while ( 1 )
> > > > {
> > > >   __asm
> > > >   {
> > > > vpand   ymm0, ymm2, ymmword ptr [rax]
> > > > vpcmpeqd ymm0, ymm0, ymm1
> > > > vpcmpeqd ymm0, ymm0, ymm1
> > > > vptest  ymm0, ymm0
> > > >   }
> > > >   if ( !_ZF )
> > > > break;
> > > >   _RAX += 8;
> > > >   if ( _RAX == v9 )
> > > > goto LABEL_5;
> > > > }
> > > > __asm { vpmaskmovd ymmword ptr [rax], ymm0, ymm3 }
> > > > _RAX += 8;
> > > >   }
> > > >   while ( _RAX != v9 );
> > > >
> > > > Why can't we just replace the vptest and if statement with some other
> > > > instructions like vpblendvb so that it can be faster? Or is there a
> > > > good way to do that?
> > >
> > > The branch is added by optimize_mask_stores after vectorization because
> > > fully masked (disabled) masked stores can incur a quite heavy penalty on
> > > some architectures when fault assists (read-only pages, but also COW 
> > > pages)
> > > are ran into.  All the microcode handling needs to possibly be carried out
> > > multiple times, for each such access to the same page.  That can cause
> > > a 1000x slowdown when you hit this case.  Thus every masked store
> > > is replaced by
> > >
> > >  if (mask != 0)
> > >masked_store ();
> > >
> > > and this is an optimization (which itself has a small cost).
> > >
> > > Richard.
> >
> > Yeah, I know that and I have seen the code of optimize_mask_store().
> > And the main problem here is that when multiple MASK_STORE appear in
> > the same loop, many branches will appear, resulting in a decrease in
> > overall efficiency.
> >
> > And my original idea is that why can't we replace MASK_STORE with more
> > effective SIMD instructions because icc can do much better in this
> > case.
>
> ICC probably doesn't care for the case where foo[] isn't writable.  In
> fact for the case at hand we see it comes from malloc() which we
> can assume to return writable memory I guess.  That means if-conversion
> can treat the unconditional read as a way to also allow to speculate
> the write (with -fallow-store-data-races).
>
> Note there's currently no pointer analysis that tracks writability.
>
> > Then I give it up, because the ability to analyze vectorization
> > of gcc is not as good as icc and my ability does not support me
> > modifying this part of the code.
> >
> > Thanks very much for your reply.
>
> You're welcome.
>
> Richard.
>
> > >
> > > >
> > > > Thanks
> > > > Hanke Zhang


Question about the pass_fre about merging if blocks.

2023-10-20 Thread Hanke Zhang via Gcc
Hi, I'm trying to make pass_fre work better for me. But I got a
problem. Like the code below:

int glob;

void __attribute__((oninline))
foo() {
  // do nothing meaningful
}

int main() {
  if (glob) {
foo();
  } else {
// do something that won't change glob
  }

  if (glob) {
foo();
  } else {
// do something that won't change glob
  }
}

I'm trying to merge the two if blocks. And of course it won't work. I
see the code in tree-ssa-structalias.cc, it will do this check:

static bool
pt_solution_includes_1 (struct pt_solution *pt, const_tree decl) {
  // 
  if (pt->nonlocal
  && is_global_var (decl))
return true;
 // 
}

So the pt->nonlocal will prevent the merge. I would like to ask if I
can make this check less rigid by providing more information about
function foo(). Because obviously the foo function does not change the
value of glob here, but it is caused by the fact that it is noinline.

So if I can prove that foo won't change glob and pass this infomation
to this check, my goal was achieved. Is this possible?

Thanks
Hanke Zhang


Re: Question about the pass_fre about merging if blocks.

2023-10-20 Thread Hanke Zhang via Gcc
Richard Biener  于2023年10月20日周五 21:33写道:
>
> On Fri, Oct 20, 2023 at 1:48 PM Hanke Zhang via Gcc  wrote:
> >
> > Hi, I'm trying to make pass_fre work better for me. But I got a
> > problem. Like the code below:
> >
> > int glob;
> >
> > void __attribute__((oninline))
> > foo() {
> >   // do nothing meaningful
> > }
> >
> > int main() {
> >   if (glob) {
> > foo();
> >   } else {
> > // do something that won't change glob
> >   }
> >
> >   if (glob) {
> > foo();
> >   } else {
> > // do something that won't change glob
> >   }
> > }
> >
> > I'm trying to merge the two if blocks. And of course it won't work. I
> > see the code in tree-ssa-structalias.cc, it will do this check:
> >
> > static bool
> > pt_solution_includes_1 (struct pt_solution *pt, const_tree decl) {
> >   // 
> >   if (pt->nonlocal
> >   && is_global_var (decl))
> > return true;
> >  // 
> > }
> >
> > So the pt->nonlocal will prevent the merge. I would like to ask if I
> > can make this check less rigid by providing more information about
> > function foo(). Because obviously the foo function does not change the
> > value of glob here, but it is caused by the fact that it is noinline.
> >
> > So if I can prove that foo won't change glob and pass this infomation
> > to this check, my goal was achieved. Is this possible?
>
> In case 'glob' were static IPA reference has this info and we'd already
> use it from call_may_clobber_ref_p.  There's IPA mod-ref which is
> a bit more powerful than IPA reference but I think it does not record
> this precise info.
>
> Note that the problem with non-static globals is that we do not know
> whether they get modified indirectly because they might have their
> address take in another TU and that address passed back into the TU.
> Usually using LTO helps here and we can promote the decl to hidden
> visibility.

Hi Richard:

Thanks for your replying.

Yeah, I know that. (We do not know whether they get modified
indirectly because they might have their address take in another TU
and that address passed back into the TU.)

But I think in my case, I think I have to let go of some security
concerns. And LTO is enabled already in my case, but that doesn't
help.

So I can understand that the information includes whether Foo uses
glob directly is actually used, but for security reasons, GCC does not
use it as a basis here (pt_solution_includes_1). Right?

So if I want to change the default behavior of GCC, can I use
call_may_clobber_ref_p to check out and make it merge here?

Thanks
Hanke Zhang

>
> Richard.
>
> >
> > Thanks
> > Hanke Zhang


Re: Question about the pass_fre about merging if blocks.

2023-10-21 Thread Hanke Zhang via Gcc
Richard Biener  于2023年10月20日周五 23:27写道:
>
>
>
> > Am 20.10.2023 um 16:33 schrieb Hanke Zhang :
> >
> > Richard Biener  于2023年10月20日周五 21:33写道:
> >>
> >>> On Fri, Oct 20, 2023 at 1:48 PM Hanke Zhang via Gcc  
> >>> wrote:
> >>>
> >>> Hi, I'm trying to make pass_fre work better for me. But I got a
> >>> problem. Like the code below:
> >>>
> >>> int glob;
> >>>
> >>> void __attribute__((oninline))
> >>> foo() {
> >>>  // do nothing meaningful
> >>> }
> >>>
> >>> int main() {
> >>>  if (glob) {
> >>>foo();
> >>>  } else {
> >>>// do something that won't change glob
> >>>  }
> >>>
> >>>  if (glob) {
> >>>foo();
> >>>  } else {
> >>>// do something that won't change glob
> >>>  }
> >>> }
> >>>
> >>> I'm trying to merge the two if blocks. And of course it won't work. I
> >>> see the code in tree-ssa-structalias.cc, it will do this check:
> >>>
> >>> static bool
> >>> pt_solution_includes_1 (struct pt_solution *pt, const_tree decl) {
> >>>  // 
> >>>  if (pt->nonlocal
> >>>  && is_global_var (decl))
> >>>return true;
> >>> // 
> >>> }
> >>>
> >>> So the pt->nonlocal will prevent the merge. I would like to ask if I
> >>> can make this check less rigid by providing more information about
> >>> function foo(). Because obviously the foo function does not change the
> >>> value of glob here, but it is caused by the fact that it is noinline.
> >>>
> >>> So if I can prove that foo won't change glob and pass this infomation
> >>> to this check, my goal was achieved. Is this possible?
> >>
> >> In case 'glob' were static IPA reference has this info and we'd already
> >> use it from call_may_clobber_ref_p.  There's IPA mod-ref which is
> >> a bit more powerful than IPA reference but I think it does not record
> >> this precise info.
> >>
> >> Note that the problem with non-static globals is that we do not know
> >> whether they get modified indirectly because they might have their
> >> address take in another TU and that address passed back into the TU.
> >> Usually using LTO helps here and we can promote the decl to hidden
> >> visibility.
> >
> > Hi Richard:
> >
> > Thanks for your replying.
> >
> > Yeah, I know that. (We do not know whether they get modified
> > indirectly because they might have their address take in another TU
> > and that address passed back into the TU.)
> >
> > But I think in my case, I think I have to let go of some security
> > concerns. And LTO is enabled already in my case, but that doesn't
> > help.
>
> It’s not security it’s correctness.
>
> >
> > So I can understand that the information includes whether Foo uses
> > glob directly is actually used, but for security reasons, GCC does not
> > use it as a basis here (pt_solution_includes_1). Right?
>
> I don’t think this function is the correct place for the fix.  I’d instead 
> put it into …
>
> > So if I want to change the default behavior of GCC, can I use
> > call_may_clobber_ref_p to check out and
>
> … this function where it would be more targeted.
>
> As said, you are going to miscompile valid code.  Note it should be possible 
> to enhance IPA reference dependent on how foo actually looks like (like if 
> there are no pointer based accesses in it).

Hi Richard:

Thanks. I understand it now. But I got another problem while handling
this. I create new nodes from existing nodes in my own passes which
are located at all_late_ipa_passes. So when I tried to get the
ipa_reference_optimization_summary in pass_fre, I got nothing because
old nodes were replaced by the new nodes I created before. Is there a
way that I can get the ipa_reference_summay via my new nodes? The main
code of getting ipa_reference_summary is here:

written = ipa_reference_get_written_global (node);

Or is it because I'm replacing nodes in the wrong way?

Thanks
Hanke Zhang

>
> Richard.
>
> > make it merge here?
> >
> > Thanks
> > Hanke Zhang
> >
> >>
> >> Richard.
> >>
> >>>
> >>> Thanks
> >>> Hanke Zhang


Re: Question about the pass_fre about merging if blocks.

2023-10-21 Thread Hanke Zhang via Gcc
Richard Biener  于2023年10月21日周六 18:23写道:
>
>
>
> > Am 21.10.2023 um 10:58 schrieb Hanke Zhang :
> >
> > Richard Biener  于2023年10月20日周五 23:27写道:
> >>
> >>
> >>
> >>>> Am 20.10.2023 um 16:33 schrieb Hanke Zhang :
> >>>
> >>> Richard Biener  于2023年10月20日周五 21:33写道:
> >>>>
> >>>>> On Fri, Oct 20, 2023 at 1:48 PM Hanke Zhang via Gcc  
> >>>>> wrote:
> >>>>>
> >>>>> Hi, I'm trying to make pass_fre work better for me. But I got a
> >>>>> problem. Like the code below:
> >>>>>
> >>>>> int glob;
> >>>>>
> >>>>> void __attribute__((oninline))
> >>>>> foo() {
> >>>>> // do nothing meaningful
> >>>>> }
> >>>>>
> >>>>> int main() {
> >>>>> if (glob) {
> >>>>>   foo();
> >>>>> } else {
> >>>>>   // do something that won't change glob
> >>>>> }
> >>>>>
> >>>>> if (glob) {
> >>>>>   foo();
> >>>>> } else {
> >>>>>   // do something that won't change glob
> >>>>> }
> >>>>> }
> >>>>>
> >>>>> I'm trying to merge the two if blocks. And of course it won't work. I
> >>>>> see the code in tree-ssa-structalias.cc, it will do this check:
> >>>>>
> >>>>> static bool
> >>>>> pt_solution_includes_1 (struct pt_solution *pt, const_tree decl) {
> >>>>> // 
> >>>>> if (pt->nonlocal
> >>>>> && is_global_var (decl))
> >>>>>   return true;
> >>>>> // 
> >>>>> }
> >>>>>
> >>>>> So the pt->nonlocal will prevent the merge. I would like to ask if I
> >>>>> can make this check less rigid by providing more information about
> >>>>> function foo(). Because obviously the foo function does not change the
> >>>>> value of glob here, but it is caused by the fact that it is noinline.
> >>>>>
> >>>>> So if I can prove that foo won't change glob and pass this infomation
> >>>>> to this check, my goal was achieved. Is this possible?
> >>>>
> >>>> In case 'glob' were static IPA reference has this info and we'd already
> >>>> use it from call_may_clobber_ref_p.  There's IPA mod-ref which is
> >>>> a bit more powerful than IPA reference but I think it does not record
> >>>> this precise info.
> >>>>
> >>>> Note that the problem with non-static globals is that we do not know
> >>>> whether they get modified indirectly because they might have their
> >>>> address take in another TU and that address passed back into the TU.
> >>>> Usually using LTO helps here and we can promote the decl to hidden
> >>>> visibility.
> >>>
> >>> Hi Richard:
> >>>
> >>> Thanks for your replying.
> >>>
> >>> Yeah, I know that. (We do not know whether they get modified
> >>> indirectly because they might have their address take in another TU
> >>> and that address passed back into the TU.)
> >>>
> >>> But I think in my case, I think I have to let go of some security
> >>> concerns. And LTO is enabled already in my case, but that doesn't
> >>> help.
> >>
> >> It’s not security it’s correctness.
> >>
> >>>
> >>> So I can understand that the information includes whether Foo uses
> >>> glob directly is actually used, but for security reasons, GCC does not
> >>> use it as a basis here (pt_solution_includes_1). Right?
> >>
> >> I don’t think this function is the correct place for the fix.  I’d instead 
> >> put it into …
> >>
> >>> So if I want to change the default behavior of GCC, can I use
> >>> call_may_clobber_ref_p to check out and
> >>
> >> … this function where it would be more targeted.
> >>
> >> As said, you are going to miscompile valid code.  Note it should be 
> >> possible to enhance IPA reference dependent on how foo actually looks like 
> >> (like if there are no pointer based accesse

Re: the elimination of if blocks in GCC during if-conversion and vectorization

2023-10-23 Thread Hanke Zhang via Gcc
Hi Richard:

Thanks for your advice. But when I try a simpler example like the one
below before looking at the code, GCC still does nothing.

int main() {
int width;
scanf("%d", &width);
int sum = 0;
for (int i = 0; i < width; i++) sum += i;
printf("%d\n", sum);
}

I tried O3 and LTO, but still the same. So I'd like to ask why, or am
I doing something wrong?

Thanks
Hanke Zhang

Richard Biener  于2023年10月19日周四 20:00写道:
>
> On Tue, Oct 17, 2023 at 2:39 PM Hanke Zhang  wrote:
> >
> > Hi Richard
> > I get it, thank you again.
> >
> > And I got another problem, so I'd like ask it by the way. Can the left
> > shift of the induction variable in a loop be optimized as a constant?
> > Like the code below:
> >
> > int ans = 0;
> > int width = rand() % 16;
> > for (int j = 0; j < width; j++)
> >   ans += 1 << (j + width)
> >
> > into:
> >
> > int width = rand() % 16;
> > ans = (1 << (2 * width) - (1 << width));
> >
> > I came across a more complex version of that and found that gcc
> > doesn't seem to handle it, so wanted to write a pass myself to
> > optimize it.
> >
> > I got two questions here. Does GCC have such optimizations? If I want
> > to do my own optimization, where should I put it? Put it behind the
> > pass_iv_optimize?
>
> GCC has the final value replacement pass (pass_scev_cprop) doing these
> kind of transforms.  Since 'ans' does not have an affine evolution this
> case would need to be pattern matched (there are some existing pattern
> matchings in the pass).
>
> > Thanks
> > Hanke Zhang
> >
> > Richard Biener  于2023年10月17日周二 20:00写道:
> > >
> > > On Tue, Oct 17, 2023 at 1:54 PM Hanke Zhang  wrote:
> > > >
> > > > Richard Biener  于2023年10月17日周二 17:26写道:
> > > > >
> > > > > On Thu, Oct 12, 2023 at 2:18 PM Hanke Zhang via Gcc  
> > > > > wrote:
> > > > > >
> > > > > > Hi, I'm recently working on vectorization of GCC. I'm stuck in a 
> > > > > > small
> > > > > > problem and would like to ask for advice.
> > > > > >
> > > > > > For example, for the following code:
> > > > > >
> > > > > > int main() {
> > > > > >   int size = 1000;
> > > > > >   int *foo = malloc(sizeof(int) * size);
> > > > > >   int c1 = rand(), t1 = rand();
> > > > > >
> > > > > >   for (int i = 0; i < size; i++) {
> > > > > > if (foo[i] & c1) {
> > > > > >   foo[i] = t1;
> > > > > > }
> > > > > >   }
> > > > > >
> > > > > >   // prevents the loop above from being optimized
> > > > > >   for (int i = 0; i < size; i++) {
> > > > > > printf("%d", foo[i]);
> > > > > >   }
> > > > > > }
> > > > > >
> > > > > > First of all, the if statement block in the loop will be converted 
> > > > > > to
> > > > > > a MASK_STORE through if-conversion optimization. But after
> > > > > > tree-vector, it will still become a branched form. The part of the
> > > > > > final disassembly structure probably looks like below(Using IDA to 
> > > > > > do
> > > > > > this), and you can see that there is still such a branch 'if ( !_ZF 
> > > > > > )'
> > > > > > in it, which will lead to low efficiency.
> > > > > >
> > > > > > do
> > > > > >   {
> > > > > > while ( 1 )
> > > > > > {
> > > > > >   __asm
> > > > > >   {
> > > > > > vpand   ymm0, ymm2, ymmword ptr [rax]
> > > > > > vpcmpeqd ymm0, ymm0, ymm1
> > > > > > vpcmpeqd ymm0, ymm0, ymm1
> > > > > > vptest  ymm0, ymm0
> > > > > >   }
> > > > > >   if ( !_ZF )
> > > > > > break;
> > > > > >   _RAX += 8;
> > > > > >   if ( _RAX == v9 )
> > > > > > goto LABEL_5;
> > > > > > }
> > > > > > __asm { vpmaskmovd ymmword ptr [rax], ymm0, ymm3 }
> > > > > > _RAX += 8;

Re: the elimination of if blocks in GCC during if-conversion and vectorization

2023-10-23 Thread Hanke Zhang via Gcc
Richard Biener  于2023年10月23日周一 20:32写道:
>
> On Mon, Oct 23, 2023 at 12:50 PM Hanke Zhang  wrote:
> >
> > Hi Richard:
> >
> > Thanks for your advice. But when I try a simpler example like the one
> > below before looking at the code, GCC still does nothing.
> >
> > int main() {
> > int width;
> > scanf("%d", &width);
> > int sum = 0;
> > for (int i = 0; i < width; i++) sum += i;
> > printf("%d\n", sum);
> > }
> >
> > I tried O3 and LTO, but still the same. So I'd like to ask why, or am
> > I doing something wrong?
>
> -fdump-tree-sccp-details-scev reveals
>
> (set_scalar_evolution
>   instantiated_below = 5
>   (scalar = sum_9)
>   (scalar_evolution = {0, +, {1, +, 1}_1}_1))
> )
> (chrec_apply
>   (varying_loop = 1)
>   (chrec = {0, +, {1, +, 1}_1}_1)
>   (x = (unsigned int) width.0_12 + 4294967295)
>   (res = scev_not_known))
>
> so we don't know how to apply a variable number of iterations to
> the affine expression {0, +, {1, +, 1}_1}_1, that is, we do not
> know how to compute the final value of the reduction.
>
> For a constant, say width == 100 we do:
>
> (set_scalar_evolution
>   instantiated_below = 2
>   (scalar = sum_6)
>   (scalar_evolution = {0, +, {1, +, 1}_1}_1))
> )
> (chrec_apply
>   (varying_loop = 1)
>   (chrec = {0, +, {1, +, 1}_1}_1)
>   (x = 99)
>   (res = 4950))

Yeah, I also found this result in previous experiments. But what
confuses me is that if the 'width' can't be inferred to INTEGER_CST,
there's nothing we can do, right?

Because in my case, the variables corresponding to 'width' are almost
all undetermined values, such as 'width = rand()'. That said, I can
hardly get any optimizations in my cases, right?

Thanks
Hanke Zhang


>
> Richard.
>
> >
> > Thanks
> > Hanke Zhang
> >
> > Richard Biener  于2023年10月19日周四 20:00写道:
> > >
> > > On Tue, Oct 17, 2023 at 2:39 PM Hanke Zhang  wrote:
> > > >
> > > > Hi Richard
> > > > I get it, thank you again.
> > > >
> > > > And I got another problem, so I'd like ask it by the way. Can the left
> > > > shift of the induction variable in a loop be optimized as a constant?
> > > > Like the code below:
> > > >
> > > > int ans = 0;
> > > > int width = rand() % 16;
> > > > for (int j = 0; j < width; j++)
> > > >   ans += 1 << (j + width)
> > > >
> > > > into:
> > > >
> > > > int width = rand() % 16;
> > > > ans = (1 << (2 * width) - (1 << width));
> > > >
> > > > I came across a more complex version of that and found that gcc
> > > > doesn't seem to handle it, so wanted to write a pass myself to
> > > > optimize it.
> > > >
> > > > I got two questions here. Does GCC have such optimizations? If I want
> > > > to do my own optimization, where should I put it? Put it behind the
> > > > pass_iv_optimize?
> > >
> > > GCC has the final value replacement pass (pass_scev_cprop) doing these
> > > kind of transforms.  Since 'ans' does not have an affine evolution this
> > > case would need to be pattern matched (there are some existing pattern
> > > matchings in the pass).
> > >
> > > > Thanks
> > > > Hanke Zhang
> > > >
> > > > Richard Biener  于2023年10月17日周二 20:00写道:
> > > > >
> > > > > On Tue, Oct 17, 2023 at 1:54 PM Hanke Zhang  
> > > > > wrote:
> > > > > >
> > > > > > Richard Biener  于2023年10月17日周二 17:26写道:
> > > > > > >
> > > > > > > On Thu, Oct 12, 2023 at 2:18 PM Hanke Zhang via Gcc 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > Hi, I'm recently working on vectorization of GCC. I'm stuck in 
> > > > > > > > a small
> > > > > > > > problem and would like to ask for advice.
> > > > > > > >
> > > > > > > > For example, for the following code:
> > > > > > > >
> > > > > > > > int main() {
> > > > > > > >   int size = 1000;
> > > > > > > >   int *foo = malloc(sizeof(int) * size);
> > > > > > > >   int c1 = rand(), t1 = rand();
> > > > > > >

Question on replacing .MASK_STORE with ternary expressions

2023-10-25 Thread Hanke Zhang via Gcc
Hi, I got a gimple relative question.

I'm trying to replace the .MASK_STORE with a ternary expression in
pass_ifcvt like below:

.MASK_STORE(addr, align, mask, value) => *addr = mask ? value : *addr;

But when I do this, I'm stucked. The addr here is a SSA_NAME of
course. When I try to get the value that 'addr' points to through a
SSA_NAME as lhs like the code below, GCC reports an error.

// get MASK_STORE
gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
tree addr = gimple_call_arg (stmt, 0);
tree mask = gimple_call_arg (stmt, 2);
tree rhs = gimple_call_arg (stmt, 3);

// deref_ssa = *addr
tree deref_ssa = create_tmp_reg_or_ssa_name (TREE_TYPE (rhs), NULL);
tree deref = build1 (INDIRECT_REF, TREE_TYPE (deref_ssa), addr);
gimple *stmt1 = gimple_build_assign(deref_ssa, deref);

// ssa1 = mask ? deref_ssa : rhs
tree cond = build3 (COND_EXPR, TREE_TYPE (rhs), mask, rhs, deref_ssa);
tree ssa1 = create_tmp_reg_or_ssa_name (TREE_TYPE (rhs), NULL);
gimple *stmt2 = gimple_build_assign(ssa1, cond);

// *addr = ssa1
tree indirect_ref = build1 (INDIRECT_REF, TREE_TYPE (deref_ssa), addr);
gimple *stmt3 = gimple_build_assign(indirect_ref, ssa1);

// insert stmts
gsi_insert_before (&gsi, stmt1, GSI_SAME_STMT);
gsi_insert_before (&gsi, stmt2, GSI_SAME_STMT);
gsi_insert_before (&gsi, stmt3, GSI_SAME_STMT);

// remove the origin MASK_STORE
gsi_remove (&gsi, true);

The error:

_588 = *_589;
_587 = _136 ? _272 : _588;
*_589 = _587;
unhandled expression in get_expr_operands():
 
unit-size 
align:32 warn_if_not_align:0 symtab:0 alias-set 8
canonical-type 0x7f25f57a12a0 precision:32
pointer_to_this >

arg:0 
public unsigned DI
size 
unit-size 
align:64 warn_if_not_align:0 symtab:0 alias-set 20
structural-equality>

def_stmt _589 = &IMAGPART_EXPR <_452->amplitude>;
version:589
ptr-info 0x7f25f5716f18>>

Is there a way to fix this?

Thanks
Hanke Zhang


Re: Question on replacing .MASK_STORE with ternary expressions

2023-10-25 Thread Hanke Zhang via Gcc
Hi Richard:

I think I need to get the value that 'addr' points to, is MEM_REF really right?

And I try to replace INDIRECT_REF with MEM_REF like this:

tree deref = build1 (MEM_REF , TREE_TYPE (deref_ssa), addr);
// ...
tree indirect_ref = build1 (MEM_REF , TREE_TYPE (deref_ssa), addr);

And the error is different, I think that means the type is not fit.

0x7afca4 build1(tree_code, tree_node*, tree_node*)
../../gcc/tree.cc:4968
0xe8292b replace_mask_store()
../../gcc/tree-if-conv.cc:3486
0xe8a03d execute
../../gcc/tree-if-conv.cc:3563
0xe8a03d execute
../../gcc/tree-if-conv.cc:3549

Richard Biener  于2023年10月25日周三 23:40写道:
>
>
>
> > Am 25.10.2023 um 17:25 schrieb Hanke Zhang via Gcc :
> >
> > Hi, I got a gimple relative question.
> >
> > I'm trying to replace the .MASK_STORE with a ternary expression in
> > pass_ifcvt like below:
> >
> > .MASK_STORE(addr, align, mask, value) => *addr = mask ? value : *addr;
> >
> > But when I do this, I'm stucked. The addr here is a SSA_NAME of
> > course. When I try to get the value that 'addr' points to through a
> > SSA_NAME as lhs like the code below, GCC reports an error.
> >
> > // get MASK_STORE
> > gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
> > tree addr = gimple_call_arg (stmt, 0);
> > tree mask = gimple_call_arg (stmt, 2);
> > tree rhs = gimple_call_arg (stmt, 3);
> >
> > // deref_ssa = *addr
> > tree deref_ssa = create_tmp_reg_or_ssa_name (TREE_TYPE (rhs), NULL);
> > tree deref = build1 (INDIRECT_REF, TREE_TYPE (deref_ssa), addr);
> > gimple *stmt1 = gimple_build_assign(deref_ssa, deref);
> >
> > // ssa1 = mask ? deref_ssa : rhs
> > tree cond = build3 (COND_EXPR, TREE_TYPE (rhs), mask, rhs, deref_ssa);
> > tree ssa1 = create_tmp_reg_or_ssa_name (TREE_TYPE (rhs), NULL);
> > gimple *stmt2 = gimple_build_assign(ssa1, cond);
> >
> > // *addr = ssa1
> > tree indirect_ref = build1 (INDIRECT_REF, TREE_TYPE (deref_ssa), addr);
> > gimple *stmt3 = gimple_build_assign(indirect_ref, ssa1);
> >
> > // insert stmts
> > gsi_insert_before (&gsi, stmt1, GSI_SAME_STMT);
> > gsi_insert_before (&gsi, stmt2, GSI_SAME_STMT);
> > gsi_insert_before (&gsi, stmt3, GSI_SAME_STMT);
> >
> > // remove the origin MASK_STORE
> > gsi_remove (&gsi, true);
> >
> > The error:
> >
> > _588 = *_589;
> > _587 = _136 ? _272 : _588;
> > *_589 = _587;
> > unhandled expression in get_expr_operands():
> >  >type  >size 
> >unit-size 
> >align:32 warn_if_not_align:0 symtab:0 alias-set 8
> > canonical-type 0x7f25f57a12a0 precision:32
> >pointer_to_this >
> >
> >arg:0  >type  > float>
> >public unsigned DI
> >size 
> >unit-size 
> >align:64 warn_if_not_align:0 symtab:0 alias-set 20
> > structural-equality>
> >
> >def_stmt _589 = &IMAGPART_EXPR <_452->amplitude>;
> >version:589
> >ptr-info 0x7f25f5716f18>>
> >
> > Is there a way to fix this?
>
> You need to use a MEM_REF , not an INDIRECT_REF
>
> > Thanks
> > Hanke Zhang


Re: Question on replacing .MASK_STORE with ternary expressions

2023-10-25 Thread Hanke Zhang via Gcc
Hi RIchard:

Sorry about the previous mail. I made a mistake that I forgot the
MEM_REF requires two parameters.

After changing to the code here, everything is fine.

tree deref = build1 (MEM_REF , TREE_TYPE (deref_ssa), addr,
build_int_cst (TREE_TYPE (addr), 0));
// ...
tree indirect_ref = build1 (MEM_REF , TREE_TYPE (deref_ssa), addr,
build_int_cst (TREE_TYPE (addr), 0));

Thanks so much for your advice! That helps me a lot! :)

Thanks
Hanke Zhang

Hanke Zhang  于2023年10月26日周四 00:09写道:

>
> Hi Richard:
>
> I think I need to get the value that 'addr' points to, is MEM_REF really 
> right?
>
> And I try to replace INDIRECT_REF with MEM_REF like this:
>
> tree deref = build1 (MEM_REF , TREE_TYPE (deref_ssa), addr);
> // ...
> tree indirect_ref = build1 (MEM_REF , TREE_TYPE (deref_ssa), addr);
>
> And the error is different, I think that means the type is not fit.
>
> 0x7afca4 build1(tree_code, tree_node*, tree_node*)
> ../../gcc/tree.cc:4968
> 0xe8292b replace_mask_store()
> ../../gcc/tree-if-conv.cc:3486
> 0xe8a03d execute
> ../../gcc/tree-if-conv.cc:3563
> 0xe8a03d execute
> ../../gcc/tree-if-conv.cc:3549
>
> Richard Biener  于2023年10月25日周三 23:40写道:
> >
> >
> >
> > > Am 25.10.2023 um 17:25 schrieb Hanke Zhang via Gcc :
> > >
> > > Hi, I got a gimple relative question.
> > >
> > > I'm trying to replace the .MASK_STORE with a ternary expression in
> > > pass_ifcvt like below:
> > >
> > > .MASK_STORE(addr, align, mask, value) => *addr = mask ? value : *addr;
> > >
> > > But when I do this, I'm stucked. The addr here is a SSA_NAME of
> > > course. When I try to get the value that 'addr' points to through a
> > > SSA_NAME as lhs like the code below, GCC reports an error.
> > >
> > > // get MASK_STORE
> > > gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
> > > tree addr = gimple_call_arg (stmt, 0);
> > > tree mask = gimple_call_arg (stmt, 2);
> > > tree rhs = gimple_call_arg (stmt, 3);
> > >
> > > // deref_ssa = *addr
> > > tree deref_ssa = create_tmp_reg_or_ssa_name (TREE_TYPE (rhs), NULL);
> > > tree deref = build1 (INDIRECT_REF, TREE_TYPE (deref_ssa), addr);
> > > gimple *stmt1 = gimple_build_assign(deref_ssa, deref);
> > >
> > > // ssa1 = mask ? deref_ssa : rhs
> > > tree cond = build3 (COND_EXPR, TREE_TYPE (rhs), mask, rhs, deref_ssa);
> > > tree ssa1 = create_tmp_reg_or_ssa_name (TREE_TYPE (rhs), NULL);
> > > gimple *stmt2 = gimple_build_assign(ssa1, cond);
> > >
> > > // *addr = ssa1
> > > tree indirect_ref = build1 (INDIRECT_REF, TREE_TYPE (deref_ssa), addr);
> > > gimple *stmt3 = gimple_build_assign(indirect_ref, ssa1);
> > >
> > > // insert stmts
> > > gsi_insert_before (&gsi, stmt1, GSI_SAME_STMT);
> > > gsi_insert_before (&gsi, stmt2, GSI_SAME_STMT);
> > > gsi_insert_before (&gsi, stmt3, GSI_SAME_STMT);
> > >
> > > // remove the origin MASK_STORE
> > > gsi_remove (&gsi, true);
> > >
> > > The error:
> > >
> > > _588 = *_589;
> > > _587 = _136 ? _272 : _588;
> > > *_589 = _587;
> > > unhandled expression in get_expr_operands():
> > >  > >type  > >size 
> > >unit-size 
> > >align:32 warn_if_not_align:0 symtab:0 alias-set 8
> > > canonical-type 0x7f25f57a12a0 precision:32
> > >pointer_to_this >
> > >
> > >arg:0  > >type  > > float>
> > >public unsigned DI
> > >size 
> > >unit-size 
> > >align:64 warn_if_not_align:0 symtab:0 alias-set 20
> > > structural-equality>
> > >
> > >def_stmt _589 = &IMAGPART_EXPR <_452->amplitude>;
> > >version:589
> > >ptr-info 0x7f25f5716f18>>
> > >
> > > Is there a way to fix this?
> >
> > You need to use a MEM_REF , not an INDIRECT_REF
> >
> > > Thanks
> > > Hanke Zhang


Question about function Split with va_args

2023-11-01 Thread Hanke Zhang via Gcc
Hi

I've been working on function splits recently, and I've noticed that
functions with va_args arguments won't be split, so why is that? I
tried to understand the comments in the source code, but I still don't
get the specific reason.

At the same time, if I do want to split functions with va_args
arguments when it's safe to do, how do I define security here? In
other words, how should I know that a function with va_args can be
split? Or I can't?

Thanks
Hanke Zhang


Question about the changing the default behavior of ipa-inline

2023-11-03 Thread Hanke Zhang via Gcc
Hi, I recently ran into an inline-related issue and would like to ask
about it. This is about the ipa-inline.

I'd like to make my function be inlined, but it trapped in the
function 'want_inline_small_function_p', more specificly, in the
conditional statement 'growth_positive_p (callee, e, growth)'. The
offline size of the function is too small and the growth is much
bigger. But I still hope it will be inlined.

Besides, the reason for its failure in the report file was
CIF_MAX_INLINE_INSNS_AUTO_LIMIT. But I don't seem to be able to fix
that by modifying the value of that parameter.

Is there another way to solve it? Or can I only change the source code
to make it?

Thanks
Hanke Zhang


Question about vectorization optimization during RTL-PASS

2023-11-12 Thread Hanke Zhang via Gcc
Hi, I've been working on vectorization-related optimization lately.
GCC seems to have some optimization vulnerabilities. I would like to
ask if it can be solved.

For example, for the following program using AVX2:

#include 
// reg->node2[i].state is an unsigned long long variable
// reg->size is an integer variable that represents the iterations

for (int i = 0; i < reg->size; i+=4) {
  /* original code:
  unsigned long long state = reg->node2[i].state;
  if (state & (1LLU << j + 1 | 1LLU << width + j))
  state ^= (1LLU << j);
  state ^= (1LLU  << width + j);
  */
  __m256i state = _mm256_loadu_si256((__m256i *)((char*)(reg->node2) +
i * sizeof(unsigned long long)));

  __m256i mask1 = _mm256_set1_epi64x(1LLU << j + 1 | 1LLU << width + j);
  // cmp
  __m256i tmp1 = _mm256_and_si256(state, mask1);
  __m256i cmp1 = _mm256_cmpeq_epi64(tmp1, mask1);
  // xor
  __m256i xor_param = _mm256_set1_epi64x(1LLU << j);
  __m256i tmp2 = _mm256_and_si256(xor_param, cmp1);
  __m256i xor_result = _mm256_xor_si256(state, tmp2);
  // xor
  __m256i xor_param2 = _mm256_set1_epi64x(1LLU << width + j);
  __m256i xor_res2 = _mm256_xor_si256(xor_result, xor_param2);

  _mm256_storeu_si256((__m256i *)((char*)(reg->node2) + i *
sizeof(unsigned long long)), xor_res2);
}

My expectation is to generate assembly code like this:

vpxor   ymm6, ymm2, ymmword ptr [r9+r15*8]
vpand   ymm4, ymm1, ymm6
vpcmpeqq ymm5, ymm4, ymm1
vpand   ymm7, ymm3, ymm5
vpxor   ymm8, ymm6, ymm7
vmovdqu ymmword ptr [r9+r15*8], ymm8

But the actual generated assembly code looks like this:

vpand   ymm0, ymm2, ymmword ptr [rsi+rax*8]
vpxor   ymm1, ymm4, ymmword ptr [rsi+rax*8]
vpcmpeqq ymm0, ymm0, ymm2
vpand   ymm0, ymm0, ymm5
vpxor   ymm0, ymm0, ymm1
vmovdqu ymmword ptr [rsi+rax*8], ymm0

That is, GCC has advanced the second XOR operation, and at the same
time has an additional address fetch operation (ymmword ptr
[rsi+rax*8]), which I think may lead to a decrease in efficiency, and
I also found that this instruction accounts for a large proportion
when I use perf.

At the same time, I found that these operations are performed on
RTL-PASS through dump-related files, and they don't seem to be easy to
change. Is there a good way to get it to generate the assembly code I
want? Is it possible to modify my own source files or GCC source code
to get that?


Question about creating an outermost loop

2023-11-20 Thread Hanke Zhang via Gcc
Hi, I'm working on loop tiling recently. I want to add this
optimization to GCC. But I have encoutered some problems here and ask
for help.

For the code below as an example:

for (int i = 0; i < 12; i++) {
  for (int j = 0; j < arr.length; j++) { // arr.length may be huge
// do something with arr[j]
  }
}

I want to create an outermost loop that wraps around the two loops of
the inner layer, and at the same time change the loop variables of the
innermost loop. The final result is as follows:

for (int k = 0; k < 8192; k++) {
  for (int i = 0; i < 12; i++) {
for (int j = 0; j < arr.length / 8192; j++) {
  // do something with arr[k * (arr.length / 8192) + j]
}
  }
}

But I don't know how to do this properly. I'm stuck with virtual
oprands and PHIs.

Is there any existing optimization in GCC that I can refer to?

Thanks.
Hanke Zhang.


Question about creating global varaiable during IPA PASS.

2023-12-13 Thread Hanke Zhang via Gcc
Hi, I'm trying to create a global variable in my own PASS which
located at the LATE_IPA_PASSES. (I'm using GCC 10.3.0.)

And after creating it, I added the attributes like the following.

// 1. create the var
tree new_name = get_identifier (xx);
tree new_type = build_pointer_type (xx);
tree new_var = build_decl (UNKNOWN_LOCATION, VAR_DECL, new_name, new_type);
add_attributes (new_var);

static void
add_attributes (tree var)
{
DECL_ARTIFICIAL (var) = 1;
DECL_EXTERNAL (var) = 0;
TREE_STATIC (var) = 1;
TREE_PUBLIC (var) = 1;
TREE_USED (var) = 1;
DECL_CONTEXT (var) = NULL_TREE;
TREE_THIS_VOLATILE (var) = 0;
TREE_ADDRESSABLE (var) = 0;
TREE_READONLY (var) = 0;
if (is_global_var (var))
  set_decl_tls_model (var, TLS_MODEL_NONE);
}

But when I try to compile some example files with -flto, error occurs.

/usr/bin/ld: xxx.ltrans0.ltrans.o: in function `xxx':
xxx.c: undefined reference to `glob_var'
xxx.c: undefined reference to `glob_var'
xxx.c: undefined reference to `glob_var'

Here `glob_var' is the global varaiable created in my PASS.

I would like to ask, am I using some attributes incorrectly?

Thanks
Hanke Zhang


Re: Question about creating global varaiable during IPA PASS.

2023-12-20 Thread Hanke Zhang via Gcc
Hi Thomas!

Thanks for your reply. That's exactly what I'm missing. When I add
varpool_node::finalize_decl() to my code, everything works fine!

Thomas Schwinge  于2023年12月16日周六 01:15写道:
>
> Hi Hanke!
>
> On 2023-12-13T17:04:57+0800, Hanke Zhang via Gcc  wrote:
> > Hi, I'm trying to create a global variable in my own PASS which
> > located at the LATE_IPA_PASSES. (I'm using GCC 10.3.0.)
>
> I can't comment on IPA aspects, or whether something was different on
> oldish GCC 10 (why using that one, by the way?), and I've not actually
> verified what you're doing here:
>
> > And after creating it, I added the attributes like the following.
> >
> > // 1. create the var
> > tree new_name = get_identifier (xx);
> > tree new_type = build_pointer_type (xx);
> > tree new_var = build_decl (UNKNOWN_LOCATION, VAR_DECL, new_name, new_type);
> > add_attributes (new_var);
> >
> > static void
> > add_attributes (tree var)
> > {
> > DECL_ARTIFICIAL (var) = 1;
> > DECL_EXTERNAL (var) = 0;
> > TREE_STATIC (var) = 1;
> > TREE_PUBLIC (var) = 1;
> > TREE_USED (var) = 1;
> > DECL_CONTEXT (var) = NULL_TREE;
> > TREE_THIS_VOLATILE (var) = 0;
> > TREE_ADDRESSABLE (var) = 0;
> > TREE_READONLY (var) = 0;
> > if (is_global_var (var))
> >   set_decl_tls_model (var, TLS_MODEL_NONE);
> > }
> >
> > But when I try to compile some example files with -flto, error occurs.
> >
> > /usr/bin/ld: xxx.ltrans0.ltrans.o: in function `xxx':
> > xxx.c: undefined reference to `glob_var'
> > xxx.c: undefined reference to `glob_var'
> > xxx.c: undefined reference to `glob_var'
> >
> > Here `glob_var' is the global varaiable created in my PASS.
> >
> > I would like to ask, am I using some attributes incorrectly?
>
> ..., but are you maybe simply missing to
> 'varpool_node::add (var);' or 'varpool_node::finalize_decl (var);' or
> something like that?  See other uses of those, and description in
> 'gcc/cgraph.h', 'struct [...] varpool_node':
>
>   /* Add the variable DECL to the varpool.
>  Unlike finalize_decl function is intended to be used
>  by middle end and allows insertion of new variable at arbitrary point
>  of compilation.  */
>   static void add (tree decl);
>
>   /* Mark DECL as finalized.  By finalizing the declaration, frontend 
> instruct
>  the middle end to output the variable to asm file, if needed or 
> externally
>  visible.  */
>   static void finalize_decl (tree decl);
>
> If that's not it, we'll have to look in more detail.
>
>
> Grüße
>  Thomas


How to compress the size of a field in a structure?

2024-01-12 Thread Hanke Zhang via Gcc
Hi, I'm attempting to compress the size of a field in a structure for
memory-friendly purposes. I created an IPA pass to achieve this, but I
ran into some issues as follows:

// original
struct Foo {
  long a1;
  int a2;
};

// modified
struct Foo_update {
  int a1;
  int a2;
};

For the example structure Foo, I use `TREE_TYPE (field) =
integer_type_node` to compress the type of a1 from `long` to `int`.

But I don't know how to update its corresponding SSA variables,
because the number of them is huge. Is there any way to do it quickly?

Thanks
Hanke Zhang


Re: How to compress the size of a field in a structure?

2024-01-13 Thread Hanke Zhang via Gcc
Hi lain,

Thanks for your thoughtful comments and concerns. The primary goal of
compressing the structure fields is to achieve memory efficiency. I
understand your points regarding potential issues with modifying the
size and types of the structure. I guess rewriting all the
corresponding statements may help with it. If there are some more
complex situations, we will give up this optimization.

But I'm more concerned about how to accomplish this.

Thanks
Hanke Zhang

Iain Sandoe  于2024年1月13日周六 17:15写道:
>
>
>
> > On 13 Jan 2024, at 07:45, Hanke Zhang via Gcc  wrote:
> >
> > Hi, I'm attempting to compress the size of a field in a structure for
> > memory-friendly purposes. I created an IPA pass to achieve this, but I
> > ran into some issues as follows:
> >
> > // original
> > struct Foo {
> >  long a1;
> >  int a2;
> > };
> >
> > // modified
> > struct Foo_update {
> >  int a1;
> >  int a2;
> > };
> >
> > For the example structure Foo, I use `TREE_TYPE (field) =
> > integer_type_node` to compress the type of a1 from `long` to `int`.
>
> Hmm.  I am curious about under what conditions this is expected to work.
>
> Altering the size of the structure (or the types it contains) would break 
> guarantees
> expected by at least c-family languages   - like sizeof, alignof
> what about arrays of the type, or embedding the type in larger aggregates?
>
> Iain
>
> > But I don't know how to update its corresponding SSA variables,
> > because the number of them is huge. Is there any way to do it quickly?
> >
> > Thanks
> > Hanke Zhang
>


How to get the default values for parameters?

2024-04-03 Thread Hanke Zhang via Gcc
Hi,
I'm trying to get the default values for parameters of some functions
in my GIMPLE-PASS. The example code is here:

enum {
  edefault1 = 1,
  edefault2 = 2,
  edefault3 = 3
}

void func(int p0, int p1 = edefault1, int p2 = edefault2, int p3 = edefault3) {
  // do other things
}

I want to get the value of `p1` here. But I didn't find a way to get it.

And I noticed that the marco `DECL_INITIAL` cannot be applied on
PARM_DECL. And the comments say that the default values for parameters
are encoded in the type of the function. But I can't find it.

If you know how to get the value, I'll be so grateful! :)

Thanks
Hanke Zhang


Question about constructing vector types in GIMPLE pass

2024-04-08 Thread Hanke Zhang via Gcc
Hi,
I've been working on strengthening auto-vectorization on intel CPUs
recently. I tried to do it in the GIMPLE pass. And I noticed that some
vector types in the GIMPLE code are confusing to me. The example code
is here:

_1 = MEM[(const __m256i_u * {ref-all})_2];

I wondered how I could construct or get the type `(const __m256i_u *
{ref-all})` in the GIMPLE pass.

If you have any ideas that can help me. I'll be so grateful! :)

Thanks
Hanke Zhang


Re: Question about constructing vector types in GIMPLE pass

2024-04-08 Thread Hanke Zhang via Gcc
Hi Marc,

Thanks for your reply.

I want to create a new type similar to this one `(const __m256i_u *
{ref-all})` indeed. And I try to create it via these calls:

tree type = build_vector_type_for_mode (intDI_type_node, V4DImode);
tree type_p = build_pointer_type_for_mode(type, VOIDmode, true);

But when I print the `type_p`, it shows `vector(4) long int *
{ref-all}`. So I'm confused if they are the same type or can be
transferred to each other.

And I'm stucked with another problem that, I want to call
`__builtin_ia32_pmovmskb256` in the GIMPLE pass. But I found that this
function is defined in `config/i386/i386-builtins.h`. And when I try
to include this header file, the error will occur during the
compilation. If you know any way to solve this problem, I would be
very grateful. :)

Thanks
Hanke Zhang

Marc Glisse  于2024年4月9日周二 03:01写道:
>
> On Mon, 8 Apr 2024, Hanke Zhang via Gcc wrote:
>
> > Hi,
> > I've been working on strengthening auto-vectorization on intel CPUs
> > recently. I tried to do it in the GIMPLE pass. And I noticed that some
> > vector types in the GIMPLE code are confusing to me. The example code
> > is here:
> >
> > _1 = MEM[(const __m256i_u * {ref-all})_2];
> >
> > I wondered how I could construct or get the type `(const __m256i_u *
> > {ref-all})` in the GIMPLE pass.
> >
> > If you have any ideas that can help me. I'll be so grateful! :)
>
> I am not sure what you are asking exactly. If you already have access to
> such a MEM_REF, then the doc tells you where to look for this type:
>
> "The first operand is the pointer being dereferenced; it will always have
> pointer or reference type.  The second operand is a pointer constant
> serving as constant offset applied to the pointer being dereferenced
> with its type specifying the type to be used for type-based alias
> analysis.
> The type of the node specifies the alignment of the access."
>
> If you want to create a new type similar to this one, you can build it
> with various tools:
>
> build_vector_type or build_vector_type_for_mode
> build_pointer_type_for_mode(*, VOIDmode, true) to build a pointer that can 
> alias anything
> build_qualified_type to add const (probably useless)
> build_aligned_type to specify that it is unaligned
>
> --
> Marc Glisse


How to format code in GCC style?

2024-04-29 Thread Hanke Zhang via Gcc
Hi
I'm trying to format my code in GCC style.

But I don't know how to finish it. I try `indent` and
`check_GNU_style.sh` in the `contrib` directory. But none of them seem
to work well.

So I wondered is there any official tools to do that?

Thanks
Hanke Zhang


Re: How to format code in GCC style?

2024-04-30 Thread Hanke Zhang via Gcc
Hi Filip,

Thanks for your reply. But I'm not so familiar with VIM actually. Thus
when I try the options listed there, nothing happened. (VSCODE is the
IDE I'm using for development.)

And I notice that there is a `vimrc` in the `contrib` directory, I
tried it too, but sadly, it didn't work for me.

Maybe I configured vim in some wrong way.

Thanks
Hanke Zhang

Filip Kastl  于2024年4月30日周二 18:00写道:
>
> On Tue 2024-04-30 13:52:16, Hanke Zhang via Gcc wrote:
> > Hi
> > I'm trying to format my code in GCC style.
> >
> > But I don't know how to finish it. I try `indent` and
> > `check_GNU_style.sh` in the `contrib` directory. But none of them seem
> > to work well.
> >
> > So I wondered is there any official tools to do that?
> >
> > Thanks
> > Hanke Zhang
>
> Hi Hanke,
>
> Not sure if this will help you but this is what I use: While writing code in
> GCC style, I use the vim options listed here:
>
> https://gcc.gnu.org/wiki/FormattingCodeForGCC
>
> Then, before I submit a patch I run it through check_GNU_style.sh.  This works
> fine for me.
>
> Cheers,
> Filip Kastl


Question about optimizing function pointers for direct function calls

2024-05-23 Thread Hanke Zhang via Gcc
Hi,
I got a question about optimizing function pointers for direct
function calls in C.

Consider the following scenario: one of the fields of a structure is a
function pointer, and all its assignments come from the same function.
Can all its uses be replaced by direct calls to this function? So the
later passes can do more optimizations.

Here is the example:

int add(int a, int b) { return a + b; }
int sub(int a, int b) { return a - b; }

struct Foo {
int (*add)(int, int);
};
int main()
{
struct Foo foo[5] = malloc(sizeof(struct Foo) * 5);

for (int i = 0; i < 5; i++) {
foo[i].add = add;
}

int sum = 0;
for (int i = 0; i < 5; i++) {
sum += foo[i].add(1, 2);
}

return 0;
}

Can I replace the code above to the code below?

int add(int a, int b) { return a + b; }
int sub(int a, int b) { return a - b; }

struct Foo {
int (*add)(int, int);
};
int main()
{
struct Foo foo[5] = malloc(sizeof(struct Foo) * 5);

for (int i = 0; i < 5; i++) {
foo[i].add = add;
}

int sum = 0;
for (int i = 0; i < 5; i++) {
sum += add(1,2);
}

return 0;
}

My idea is to determine whether the assignment of the field is the
same function, and if so, perform the replacement.

Of course this is not a reasonable optimization, I just want to know
if there are security issues in doing so, and if I want to do it in
the IPA stage, is it possible?

Thanks
Hanke Zhang


Question about the SLP vectorizer failed to perform automatic vectorization in one case

2024-05-25 Thread Hanke Zhang via Gcc
Hi,
I'm trying to studing the automatic vectorization optimization in GCC,
but I found one case that SLP vectorizer failed to do such things.

Here is the sample code: (also a simplification version of a function
from the 625/525.x264 source code in SPEC CPU 2017)

void pixel_sub_wxh(int16_t *diff, uint8_t *pix1, uint8_t *pix2) {
  for (int y = 0; y < 4; y++) {
for (int x = 0; x < 4; x++)
  diff[x + y * 4] = pix1[x] - pix2[x];
pix1 += 16;
pix2 += 32;
  }
}

When I compiled with `-O3 -mavx2/-msse4.2`, SLP vectorizer failed to
vectorize it, and I got the following message when adding
`-fopt-info-vec-all`. (The inner loop will be unrolled)

:6:21: optimized: loop vectorized using 8 byte vectors
:6:21: optimized:  loop versioned for vectorization because of
possible aliasing
:5:6: note: vectorized 1 loops in function.
:5:6: note: * Analysis failed with vector mode V8SI
:5:6: note: * The result for vector mode V32QI would be the same
:5:6: note: * Re-trying analysis with vector mode V16QI
:5:6: note: * Analysis failed with vector mode V16QI
:5:6: note: * Re-trying analysis with vector mode V8QI
:5:6: note: * Analysis failed with vector mode V8QI
:5:6: note: * Re-trying analysis with vector mode V4QI
:5:6: note: * Analysis failed with vector mode V4QI

If I manually use the type declaration provided by `immintrin.h` to
rewrite the code, the code is as follows (which I hope the SLP
vectorizer to be able to do)

void pixel_sub_wxh_vec(int16_t *diff, uint8_t *pix1, uint8_t *pix2) {
  for (int y = 0; y < 4; y++) {
__v4hi pix1_v = {pix1[0], pix1[1], pix1[2], pix1[3]};
__v4hi pix2_v = {pix2[0], pix2[1], pix2[2], pix2[3]};
__v4hi diff_v = pix1_v - pix2_v;
*(long long *)(diff + y * 4) = (long long)diff_v;
pix1 += 16;
pix2 += 32;
  }
}

What I want to know is why SLP vectorizer can't vectorize the code
here, and what changes do I need to make to SLP vectorizer or the
source code if I want it to do so?

Thanks
Hanke Zhang


Re: Question about the SLP vectorizer failed to perform automatic vectorization in one case

2024-05-27 Thread Hanke Zhang via Gcc
Hi Biener,

Thanks for your help!

I have already open a bugreport here
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115252.

Thanks
Hanke Zhang

Richard Biener  于2024年5月27日周一 21:14写道:
>
> On Sat, May 25, 2024 at 3:08 PM Hanke Zhang via Gcc  wrote:
> >
> > Hi,
> > I'm trying to studing the automatic vectorization optimization in GCC,
> > but I found one case that SLP vectorizer failed to do such things.
> >
> > Here is the sample code: (also a simplification version of a function
> > from the 625/525.x264 source code in SPEC CPU 2017)
> >
> > void pixel_sub_wxh(int16_t *diff, uint8_t *pix1, uint8_t *pix2) {
> >   for (int y = 0; y < 4; y++) {
> > for (int x = 0; x < 4; x++)
> >   diff[x + y * 4] = pix1[x] - pix2[x];
> > pix1 += 16;
> > pix2 += 32;
>
> The issue is these increments, with only four uint8_t elements accessed
> we still want to fill up a vectors worth of them.
>
> In the end we succeed with v4hi / v8qi but also peel for gaps even though
> we handle the half-load case fine.
>
> >   }
> > }
> >
> > When I compiled with `-O3 -mavx2/-msse4.2`, SLP vectorizer failed to
> > vectorize it, and I got the following message when adding
> > `-fopt-info-vec-all`. (The inner loop will be unrolled)
> >
> > :6:21: optimized: loop vectorized using 8 byte vectors
> > :6:21: optimized:  loop versioned for vectorization because of
> > possible aliasing
> > :5:6: note: vectorized 1 loops in function.
>
> ^^^
>
> so you do see the vectorization as outlined above.
>
> > :5:6: note: * Analysis failed with vector mode V8SI
> > :5:6: note: * The result for vector mode V32QI would be the same
> > :5:6: note: * Re-trying analysis with vector mode V16QI
> > :5:6: note: * Analysis failed with vector mode V16QI
> > :5:6: note: * Re-trying analysis with vector mode V8QI
> > :5:6: note: * Analysis failed with vector mode V8QI
> > :5:6: note: * Re-trying analysis with vector mode V4QI
> > :5:6: note: * Analysis failed with vector mode V4QI
> >
> > If I manually use the type declaration provided by `immintrin.h` to
> > rewrite the code, the code is as follows (which I hope the SLP
> > vectorizer to be able to do)
> >
> > void pixel_sub_wxh_vec(int16_t *diff, uint8_t *pix1, uint8_t *pix2) {
> >   for (int y = 0; y < 4; y++) {
> > __v4hi pix1_v = {pix1[0], pix1[1], pix1[2], pix1[3]};
> > __v4hi pix2_v = {pix2[0], pix2[1], pix2[2], pix2[3]};
> > __v4hi diff_v = pix1_v - pix2_v;
> > *(long long *)(diff + y * 4) = (long long)diff_v;
>
> We kind-of do it this way, just
>
> __v8qi pix1_v = {pix1[0], pix1[1], pix1[2], pix1[3], 0, 0, 0, 0};
> ...
>
> and then unpack __v8qi low to v4hi.
>
> And unfortunately the last two outer iterations are scalar because of the
> gap issue.  There's some PRs about this, I did start to work on improving 
> this,
> I'm not sure this exact case is covered so can you open a new bugreport?
>
> > pix1 += 16;
> > pix2 += 32;
> >   }
> > }
> >
> > What I want to know is why SLP vectorizer can't vectorize the code
> > here, and what changes do I need to make to SLP vectorizer or the
> > source code if I want it to do so?
> >
> > Thanks
> > Hanke Zhang


Question about generating vpmovzxbd instruction without using the interfaces in immintrin.h

2024-05-30 Thread Hanke Zhang via Gcc
Hi,
I've recently been trying to hand-write code to trigger automatic
vectorization optimizations in GCC on Intel x86 machines (without
using the interfaces in immintrin.h), but I'm running into a problem
where I can't seem to get the concise `vpmovzxbd` or similar
instructions.

My requirement is to convert 8 `uint8_t` elements to `int32_t` type
and print the output. If I use the interface (_mm256_cvtepu8_epi32) in
immintrin.h, the code is as follows:

int immintrin () {
int size = 1, offset = 3;
uint8_t* a = malloc(sizeof(char) * size);

__v8si b = (__v8si)_mm256_cvtepu8_epi32(*(__m128i *)(a + offset));

for (int i = 0; i < 8; i++) {
printf("%d\n", b[i]);
}
}

After compiling with -mavx2 -O3, you can get concise and efficient
instructions. (You can see it here: https://godbolt.org/z/8ojzdav47)

But if I do not use this interface and instead use a for-loop or the
`__builtin_convertvector` interface provided by GCC, I cannot achieve
the above effect. The code is as follows:

typedef uint8_t v8qiu __attribute__ ((__vector_size__ (8)));
int forloop () {
int size = 1, offset = 3;
uint8_t* a = malloc(sizeof(char) * size);

v8qiu av = *(v8qiu *)(a + offset);
__v8si b = {};
for (int i = 0; i < 8; i++) {
b[i] = (a + offset)[i];
}

for (int i = 0; i < 8; i++) {
printf("%d\n", b[i]);
}
}

int builtin_cvt () {
int size = 1, offset = 3;
uint8_t* a = malloc(sizeof(char) * size);

v8qiu av = *(v8qiu *)(a + offset);
__v8si b = __builtin_convertvector(av, __v8si);

for (int i = 0; i < 8; i++) {
printf("%d\n", b[i]);
}
}

The instructions generated by both functions are redundant and
complex, and are quite difficult to read compared to calling
`_mm256_cvtepu8_epi32` directly. (You can see it here as well:
https://godbolt.org/z/8ojzdav47)

What I want to ask is: How should I write the source code to get
assembly instructions similar to directly calling
_mm256_cvtepu8_epi32?

Or would it be easier if I modified the GIMPLE directly? But it seems
that there is no relevant expression or interface directly
corresponding to `vpmovzxbd` in GIMPLE.

Thanks
Hanke Zhang


How to represent a fallthrough condtion (with no else) in "Match and Simplify"?

2024-06-11 Thread Hanke Zhang via Gcc
Hi,

I'm trying to study "Match and Simplify" recently, and I had this sample code:

int main() {
  int n = 1000;
  int *a = malloc (sizeof(int) * n);
  int *b = malloc (sizeof(int) * n);
  int *c = malloc (sizeof(int) * n);
  for (int i = 0; i < n; i++) {
if (a[i] & b[i]) {
  a[i] ^= c[i];
}
  }
}

But this code cannot be vectorized very well. I hope it can become like this:

int main() {
  int n = 1000;
  int *a = malloc (sizeof(int) * n);
  int *b = malloc (sizeof(int) * n);
  int *c = malloc (sizeof(int) * n);
  for (int i = 0; i < n; i++) {
int cond = ((a[i] & b[i]) == 1);
unsigned int mask = cond ? -1 : 0;
a[i] ^= (c[i] & mask);
  }
}


This can finally result in concise and efficient vectorized
instructions. But I want to know if this can be achieved through
"Match and Simplify"? Because when I tried to write the pattern, I
found that the condtional statement here seemed not to be matched
well, as there is not an else block.

Or is this not possible with "Match and Simplify"? Is it possible to
implement it in if-conversion?

Thanks
Hanke Zhang


Re: Question about optimizing function pointers for direct function calls

2024-06-12 Thread Hanke Zhang via Gcc
Richard Biener  于2024年5月24日周五 14:39写道:
>
> On Fri, May 24, 2024 at 5:53 AM Hanke Zhang via Gcc  wrote:
> >
> > Hi,
> > I got a question about optimizing function pointers for direct
> > function calls in C.
> >
> > Consider the following scenario: one of the fields of a structure is a
> > function pointer, and all its assignments come from the same function.
> > Can all its uses be replaced by direct calls to this function? So the
> > later passes can do more optimizations.
> >
> > Here is the example:
> >
> > int add(int a, int b) { return a + b; }
> > int sub(int a, int b) { return a - b; }
> >
> > struct Foo {
> > int (*add)(int, int);
> > };
> > int main()
> > {
> > struct Foo foo[5] = malloc(sizeof(struct Foo) * 5);
> >
> > for (int i = 0; i < 5; i++) {
> > foo[i].add = add;
> > }
> >
> > int sum = 0;
> > for (int i = 0; i < 5; i++) {
> > sum += foo[i].add(1, 2);
> > }
> >
> > return 0;
> > }
> >
> > Can I replace the code above to the code below?
> >
> > int add(int a, int b) { return a + b; }
> > int sub(int a, int b) { return a - b; }
> >
> > struct Foo {
> > int (*add)(int, int);
> > };
> > int main()
> > {
> > struct Foo foo[5] = malloc(sizeof(struct Foo) * 5);
> >
> > for (int i = 0; i < 5; i++) {
> > foo[i].add = add;
> > }
> >
> > int sum = 0;
> > for (int i = 0; i < 5; i++) {
> > sum += add(1,2);
> > }
> >
> > return 0;
> > }
> >
> > My idea is to determine whether the assignment of the field is the
> > same function, and if so, perform the replacement.
>
> If it's as simple as above then sure, even CSE should do it.  If you
> can prove otherwise the memory location with the function pointer
> always has the same value you are obviously fine.  If you just
> do not see any other store via 'add's FIELD_DECL then no, that
> isn't good enough.  Every void * store you do not know where it
> goes might go to that slot.
>
> > Of course this is not a reasonable optimization, I just want to know
> > if there are security issues in doing so, and if I want to do it in
> > the IPA stage, is it possible?
>
> For the more general case you can do what we do for speculative
> devirtualization - replace the code with
>
>   sum += foo[i].add == add ? add (1,2) : foo[i].add(1,2);

Hi Richard,

I'm trying to do what you suggested. (sum += foo[i].add == add ? add
(1,2) : foo[i].add(1,2);)

I created a new IPA-Pass before IPA-INLINE and I made the changes on
the GIMPLE in the "function_transform" stage. But my newly created
"foo[i].add(1,2)" seems to fail to be inlined. And it was optimized
out in the subsequent FRE. I would like to ask if there is any way to
mark my newly created cgraph_edge as "inline" in the
"function_transform" stage.

Here is part of the code creating the call_stmt in true basic_block in
my IPA-PASS:

// ...
// part of the code have been omitted
auto_vec params;
for (int i = 0; i < gimple_call_num_args (call_stmt); i++) {
  params.safe_push (gimple_call_arg (call_stmt, i));
}
gimple* tbb_call = gimple_build_call_vec (fndecl, params); /// = fn()
tree tbb_ssa;
if (ret_val_flag) {
  tbb_ssa = make_ssa_name (TREE_TYPE (gimple_call_lhs(call_stmt)), NULL);
  gimple_call_set_lhs (tbb_call, tbb_ssa); /// _2 = fn()
}
gsi = gsi_start_bb (tbb);
gsi_insert_before (&gsi, tbb_call, GSI_SAME_STMT);
cgraph_edge* tbb_callee = node->create_edge (cgraph_node::get_create
(fndecl), (gcall*)tbb_call, tbb_call->bb->count, true);
// what should I do to this cgraph_edge to mark it to be inlined
// ...

Or should I not do these things in the function_transform stage?

Thanks
Hanke Zhang


>
> that way we can inline the direct call and hopefully the branch will be
> well predicted.
>
> In some SPEC there is IIRC the situation where such speculative
> devirtualization candidates can be found solely based on function
> signature.  With LTO/IPA you'd basically collect candidate targets
> for each indirect call (possibly address-taken function definitions
> with correct signature) and if there's just a single one you can
> choose that as speculative devirt target.
>
> Speculative devirt as we have now of course works with profile
> data to identify the most probable candidate.
>
> Richard.
>
> >
> > Thanks
> > Hanke Zhang


Re: Question about optimizing function pointers for direct function calls

2024-06-12 Thread Hanke Zhang via Gcc
Hi Richard,

Thanks for your reply. I'm looking forward to hearing from Honza!

But I still have a small question about the IPA passes. If I want to
make such a conversion (sum += foo[i].add == add ? add (1,2) :
foo[i].add (1,2);), I definitely need to change the function's body.
But the article "Using summary information in IPA
passes"(https://gcc.gnu.org/onlinedocs/gccint/link-time-optimization/using-summary-
Information-in-ipa-passes.html) mentioned that if IPA-Pass needs to
change the function body, it needs to be done in the transform stage,
but the execution stage of IPA-inline has ended before the transform
stage. So we still can’t mark the newly created function
calls/cgraph_edge as inlineable. Is that correct?

Or did I understand something wrong?

Thanks
Hanke Zhang

Richard Biener  于2024年6月12日周三 18:49写道:
>
> On Wed, Jun 12, 2024 at 11:57 AM Hanke Zhang  wrote:
> >
> > Richard Biener  于2024年5月24日周五 14:39写道:
> > >
> > > On Fri, May 24, 2024 at 5:53 AM Hanke Zhang via Gcc  
> > > wrote:
> > > >
> > > > Hi,
> > > > I got a question about optimizing function pointers for direct
> > > > function calls in C.
> > > >
> > > > Consider the following scenario: one of the fields of a structure is a
> > > > function pointer, and all its assignments come from the same function.
> > > > Can all its uses be replaced by direct calls to this function? So the
> > > > later passes can do more optimizations.
> > > >
> > > > Here is the example:
> > > >
> > > > int add(int a, int b) { return a + b; }
> > > > int sub(int a, int b) { return a - b; }
> > > >
> > > > struct Foo {
> > > > int (*add)(int, int);
> > > > };
> > > > int main()
> > > > {
> > > > struct Foo foo[5] = malloc(sizeof(struct Foo) * 5);
> > > >
> > > > for (int i = 0; i < 5; i++) {
> > > > foo[i].add = add;
> > > > }
> > > >
> > > > int sum = 0;
> > > > for (int i = 0; i < 5; i++) {
> > > > sum += foo[i].add(1, 2);
> > > > }
> > > >
> > > > return 0;
> > > > }
> > > >
> > > > Can I replace the code above to the code below?
> > > >
> > > > int add(int a, int b) { return a + b; }
> > > > int sub(int a, int b) { return a - b; }
> > > >
> > > > struct Foo {
> > > > int (*add)(int, int);
> > > > };
> > > > int main()
> > > > {
> > > > struct Foo foo[5] = malloc(sizeof(struct Foo) * 5);
> > > >
> > > > for (int i = 0; i < 5; i++) {
> > > > foo[i].add = add;
> > > > }
> > > >
> > > > int sum = 0;
> > > > for (int i = 0; i < 5; i++) {
> > > > sum += add(1,2);
> > > > }
> > > >
> > > > return 0;
> > > > }
> > > >
> > > > My idea is to determine whether the assignment of the field is the
> > > > same function, and if so, perform the replacement.
> > >
> > > If it's as simple as above then sure, even CSE should do it.  If you
> > > can prove otherwise the memory location with the function pointer
> > > always has the same value you are obviously fine.  If you just
> > > do not see any other store via 'add's FIELD_DECL then no, that
> > > isn't good enough.  Every void * store you do not know where it
> > > goes might go to that slot.
> > >
> > > > Of course this is not a reasonable optimization, I just want to know
> > > > if there are security issues in doing so, and if I want to do it in
> > > > the IPA stage, is it possible?
> > >
> > > For the more general case you can do what we do for speculative
> > > devirtualization - replace the code with
> > >
> > >   sum += foo[i].add == add ? add (1,2) : foo[i].add(1,2);
> >
> > Hi Richard,
> >
> > I'm trying to do what you suggested. (sum += foo[i].add == add ? add
> > (1,2) : foo[i].add(1,2);)
> >
> > I created a new IPA-Pass before IPA-INLINE and I made the changes on
> > the GIMPLE in the "function_transform" stage. But my newly created
> > "foo[i].add(1,2)" seems to fail to be inlined. And it was optimized
> > out in the subsequent FRE. I would like to ask if there is any w

Question about enabling additional flags in Ofast

2024-07-13 Thread Hanke Zhang via Gcc
Hi,

I'm attempting to enable more flags in Ofast, but I've encountered some issues.

For instance, if I want to add -flto-partition=one to Ofast, here is
the modification I made to opts.cc

/* -Ofast adds optimizations to -O3.  */
{ OPT_LEVELS_FAST, OPT_ffast_math, NULL, 1 },
{ OPT_LEVELS_FAST, OPT_fallow_store_data_races, NULL, 1 },
{ OPT_LEVELS_FAST, OPT_fsemantic_interposition, NULL, 0 },
++ { OPT_LEVELS_FAST, OPT_flto_partition_, NULL, LTO_PARTITION_ONE },

However, when I run the tests (make -k check), I encounter some
problems. One of the failed test files is gcc.dg/pr89342.c:

/* PR other/89342 */
/* { dg-do compile } */
/* { dg-options "-O0" } */

__attribute__((optimize("Ofast")))
void foo (void)
{
  __attribute__((optimize("no-inline")))
  void bar (void) {}
  bar ();
}

The error message is as follows:

../gcc/testsuite/gcc.dg/pr89342.c:7:1: internal compiler error:
‘global_options’ are modified in local context
7 | {
  | ^
0x9e8cf4 handle_optimize_attribute
../../gcc/c-family/c-attribs.cc:5568
0x8e0902 decl_attributes(tree_node**, tree_node*, int, tree_node*)
../../gcc/attribs.cc:872
0x8fe39c start_function(c_declspecs*, c_declarator*, tree_node*)
../../gcc/c/c-decl.cc:9519
0x962336 c_parser_declaration_or_fndef
../../gcc/c/c-parser.cc:2445
0x96b803 c_parser_external_declaration
../../gcc/c/c-parser.cc:1779
0x96c263 c_parser_translation_unit
../../gcc/c/c-parser.cc:1652
0x96c263 c_parse_file()
../../gcc/c/c-parser.cc:23372
0x9cf4a5 c_common_parse_file()
../../gcc/c-family/c-opts.cc:1240

Upon debugging, I found that the error occurs in the
cl_optimization_compare(gcc_options *ptr1, gcc_options *ptr2) function
in options-save.cc. Specifically, the discrepancy is here:

ptr1->x_flag_lto_partition: 2, ptr2->x_flag_lto_partition: 1

This discrepancy leads to the error. Could you advise on how to
resolve this issue?

(Note: My ultimate goal is not to add the -flto-partition=one flag.
This is just an example to illustrate the problem.)

Thank you for your assistance.

Thanks
Hanke Zhang


Re: Question about optimizing dynamic_cast call in C++ at GIMPLE Level

2025-01-18 Thread Hanke Zhang via Gcc
Hi Florian,

Thank you very much for your reply.

This is not related to ELF. What I am trying to do is simply obtain
the 'Derived' type node from the parameter that is retrieved via the
'__dynamic_cast' call.

I believe this should be a relatively simple coding problem. However,
I am not very familiar with GCC source code at present, and that is
the reason why I am asking for help here.

Thanks,
Hanke Zhang

Florian Weimer  于2025年1月18日周六 20:50写道:
>
> * Hanke Zhang via Gcc:
>
> > I have recently been delving into optimizing dynamic_cast calls in C++
> > within the context of GIMPLE code. In particular, for scenarios
> > involving single inheritance, I'm aiming to convert the following
> > code:
> >
> > _1 = __dynamic_cast (obj_1(D), &_ZTI7Base, &_ZTI10Derived, 0);
> > if (_1!= 0B)
> >
> > into a more direct and efficient form, such as:
> >
> > _2 = MEM[obj_1(D)];
> > _3 = MEM[_2 + -8B];
> > if (_3 == &_ZTI10Derived)
> >
> > However, during this process, I encountered an issue. I attempted to
> > obtain the type node of the Derived class through the _ZTI10Derived,
> > which I believe should be a VAR_DECL. Regrettably, my efforts were
> > unsuccessful.
>
> Is this for ELF?  I think our ELF ABIs require that a full string
> comparison is performed if the pointers do not match.  Otherwise, types
> with different vtables will suddenly be considered distinct under RTTI.
> The symbols for vtables can be hidden if “local: *;” is used in a linker
> script, for example.  Currently, as far as I understand it, this does
> not have a direct impact on RTTI.
>
> Thanks,
> Florian
>


Question about optimizing dynamic_cast call in C++ at GIMPLE Level

2025-01-18 Thread Hanke Zhang via Gcc
Hi,

I have recently been delving into optimizing dynamic_cast calls in C++
within the context of GIMPLE code. In particular, for scenarios
involving single inheritance, I'm aiming to convert the following
code:

_1 = __dynamic_cast (obj_1(D), &_ZTI7Base, &_ZTI10Derived, 0);
if (_1!= 0B)

into a more direct and efficient form, such as:

_2 = MEM[obj_1(D)];
_3 = MEM[_2 + -8B];
if (_3 == &_ZTI10Derived)

However, during this process, I encountered an issue. I attempted to
obtain the type node of the Derived class through the _ZTI10Derived,
which I believe should be a VAR_DECL. Regrettably, my efforts were
unsuccessful.

I'm reaching out to inquire if there's a viable method to achieve
this. Specifically, is there a way to retrieve the type node of the
original class from the third parameter of the __dynamic_cast call?
Additionally, if there are any other aspects I should be aware of
during this optimization process, I would greatly appreciate your
insights.

Thank you very much for your time and assistance.

Best regards,
Hanke Zhang