How to make parallelizing loops and vectorization work at the same time?
Hi I'm trying to accelerate my program with -ftree-vectorize and -ftree-parallelize-loops. Here are my test results using the different options (based on gcc10.3.0 on i9-12900KF): gcc-10 test.c -O3 -flto > time: 29000 ms gcc-10 test.c -O3 -flto -mavx2 -ftree-vectorize > time: 17000 ms gcc-10 test.c -O3 -flto -ftree-parallelize-loops=24 > time: 5000 ms gcc-10 test.c -O3 -flto -ftree-parallelize-loops=24 -mavx2 -ftree-vectorize > time: 5000 ms I found that these two options do not work at the same time, that is, if I use the `-ftree-vectorize` option alone, it can bring a big efficiency gain compared to doing nothing; At the same time, if I use the option of `-ftree-parallelize-loops` alone, it will also bring a big efficiency gain. But if I use both options, vectorization fails, that is, I can't get the benefits of vectorization, I can only get the benefits of parallelizing loops. I know that the reason may be that after parallelizing the loop, vectorization cannot be performed, but is there any way I can reap the benefits of both optimizations? Here is my example program, adapted from the 462.libquantum in speccpu2006: ``` #include #include #include #define MAX_UNSIGNED unsigned long long struct quantum_reg_node_struct { float _Complex *amplitude; /* alpha_j */ MAX_UNSIGNED *state; /* j */ }; typedef struct quantum_reg_node_struct quantum_reg_node; struct quantum_reg_struct { int width; /* number of qubits in the qureg */ int size; /* number of non-zero vectors */ int hashw; /* width of the hash array */ quantum_reg_node *node; int *hash; }; typedef struct quantum_reg_struct quantum_reg; void quantum_toffoli(int control1, int control2, int target, quantum_reg *reg) { for (int i = 0; i < reg->size; i++) { if (reg->node->state[i] & ((MAX_UNSIGNED)1 << control1)) { if (reg->node->state[i] & ((MAX_UNSIGNED)1 << control2)) { reg->node->state[i] ^= ((MAX_UNSIGNED)1 << target); } } } } int get_random() { return rand() % 64; } void init(quantum_reg *reg) { reg->size = 2097152; for (int i = 0; i < reg->size; i++) { reg->node = (quantum_reg_node *)malloc(sizeof(quantum_reg_node)); reg->node->state = (MAX_UNSIGNED *)malloc(sizeof(MAX_UNSIGNED) * reg->size); reg->node->amplitude = (float _Complex *)malloc(sizeof(float _Complex) * reg->size); if (i >= 1) break; } for (int i = 0; i < reg->size; i++) { reg->node->amplitude[i] = 0; reg->node->state[i] = 0; } } int main() { quantum_reg reg; init(®); for (int i = 0; i < 65000; i++) { quantum_toffoli(get_random(), get_random(), get_random(), ®); } } ``` Thanks so much.
Re: How to make parallelizing loops and vectorization work at the same time?
Richard Biener 于2023年9月15日周五 19:59写道: > > On Fri, Sep 15, 2023 at 1:21 PM Hanke Zhang via Gcc wrote: > > > > Hi I'm trying to accelerate my program with -ftree-vectorize and > > -ftree-parallelize-loops. > > > > Here are my test results using the different options (based on > > gcc10.3.0 on i9-12900KF): > > gcc-10 test.c -O3 -flto > > > time: 29000 ms > > gcc-10 test.c -O3 -flto -mavx2 -ftree-vectorize > > > time: 17000 ms > > gcc-10 test.c -O3 -flto -ftree-parallelize-loops=24 > > > time: 5000 ms > > gcc-10 test.c -O3 -flto -ftree-parallelize-loops=24 -mavx2 -ftree-vectorize > > > time: 5000 ms > > > > First of all -O3 already enables -ftree-vectorize, adding -mavx2 is what > brings > the first gain. So adding -ftree-vectorize to the last command-line is not > expected to change anything. Instead you can use -fno-tree-vectorize on > the second last one. Doing that I get 111s vs 41s thus doing both helps. > > Note parallelization hasn't seen any development in the last years. > > Richard. Hi Richard: Thank you for your sincere reply. I get what you mean above. But I still see the following after I add `-fipo-info-vec`: gcc-10 test.c -O3 -flto -mavx2 -fopt-info-vec > test.c:29:5: optimized: loop vectorized using 32 byte vectors gcc-10 test.c -O3 -flto -mavx2 -fopt-info-vec -ftree-parallelize-loops=24 > nothing happened That means the vectorization does nothing help actually. At the same time, I added `-fno-tree-vectorize` to the second last one command. It did not bring about a performance change on my computer. So I still think only parallel loops work. Hanke Zhang > > > I found that these two options do not work at the same time, that is, > > if I use the `-ftree-vectorize` option alone, it can bring a big > > efficiency gain compared to doing nothing; At the same time, if I use > > the option of `-ftree-parallelize-loops` alone, it will also bring a > > big efficiency gain. But if I use both options, vectorization fails, > > that is, I can't get the benefits of vectorization, I can only get the > > benefits of parallelizing loops. > > > > I know that the reason may be that after parallelizing the loop, > > vectorization cannot be performed, but is there any way I can reap the > > benefits of both optimizations? > > > > Here is my example program, adapted from the 462.libquantum in speccpu2006: > > > > ``` > > #include > > #include > > #include > > > > #define MAX_UNSIGNED unsigned long long > > > > struct quantum_reg_node_struct { > > float _Complex *amplitude; /* alpha_j */ > > MAX_UNSIGNED *state; /* j */ > > }; > > > > typedef struct quantum_reg_node_struct quantum_reg_node; > > > > struct quantum_reg_struct { > > int width; /* number of qubits in the qureg */ > > int size; /* number of non-zero vectors */ > > int hashw; /* width of the hash array */ > > quantum_reg_node *node; > > int *hash; > > }; > > > > typedef struct quantum_reg_struct quantum_reg; > > > > void quantum_toffoli(int control1, int control2, int target, quantum_reg > > *reg) { > > for (int i = 0; i < reg->size; i++) { > > if (reg->node->state[i] & ((MAX_UNSIGNED)1 << control1)) { > > if (reg->node->state[i] & ((MAX_UNSIGNED)1 << control2)) { > > reg->node->state[i] ^= ((MAX_UNSIGNED)1 << target); > > } > > } > > } > > } > > > > int get_random() { > > return rand() % 64; > > } > > > > void init(quantum_reg *reg) { > > reg->size = 2097152; > > for (int i = 0; i < reg->size; i++) { > > reg->node = (quantum_reg_node *)malloc(sizeof(quantum_reg_node)); > > reg->node->state = (MAX_UNSIGNED *)malloc(sizeof(MAX_UNSIGNED) > > * reg->size); > > reg->node->amplitude = (float _Complex *)malloc(sizeof(float > > _Complex) * reg->size); > > if (i >= 1) break; > > } > > for (int i = 0; i < reg->size; i++) { > > reg->node->amplitude[i] = 0; > > reg->node->state[i] = 0; > > } > > } > > > > int main() { > > quantum_reg reg; > > init(®); > > for (int i = 0; i < 65000; i++) { > > quantum_toffoli(get_random(), get_random(), get_random(), ®); > > } > > } > > ``` > > > > Thanks so much.
Re: How to make parallelizing loops and vectorization work at the same time?
I get it. It's a `lto` problem. If I remove `-flto`, both work. Thanks for your help again! Richard Biener 于2023年9月15日周五 21:13写道: > > On Fri, Sep 15, 2023 at 3:09 PM Hanke Zhang wrote: > > > > Richard Biener 于2023年9月15日周五 19:59写道: > > > > > > > > On Fri, Sep 15, 2023 at 1:21 PM Hanke Zhang via Gcc > > > wrote: > > > > > > > > Hi I'm trying to accelerate my program with -ftree-vectorize and > > > > -ftree-parallelize-loops. > > > > > > > > Here are my test results using the different options (based on > > > > gcc10.3.0 on i9-12900KF): > > > > gcc-10 test.c -O3 -flto > > > > > time: 29000 ms > > > > gcc-10 test.c -O3 -flto -mavx2 -ftree-vectorize > > > > > time: 17000 ms > > > > gcc-10 test.c -O3 -flto -ftree-parallelize-loops=24 > > > > > time: 5000 ms > > > > gcc-10 test.c -O3 -flto -ftree-parallelize-loops=24 -mavx2 > > > > -ftree-vectorize > > > > > time: 5000 ms > > > > > > > > > > First of all -O3 already enables -ftree-vectorize, adding -mavx2 is what > > > brings > > > the first gain. So adding -ftree-vectorize to the last command-line is > > > not > > > expected to change anything. Instead you can use -fno-tree-vectorize on > > > the second last one. Doing that I get 111s vs 41s thus doing both helps. > > > > > > Note parallelization hasn't seen any development in the last years. > > > > > > Richard. > > > > Hi Richard: > > > > Thank you for your sincere reply. > > > > I get what you mean above. But I still see the following after I add > > `-fipo-info-vec`: > > > > gcc-10 test.c -O3 -flto -mavx2 -fopt-info-vec > > > test.c:29:5: optimized: loop vectorized using 32 byte vectors > > gcc-10 test.c -O3 -flto -mavx2 -fopt-info-vec -ftree-parallelize-loops=24 > > > nothing happened > > > > That means the vectorization does nothing help actually. > > > > At the same time, I added `-fno-tree-vectorize` to the second last one > > command. It did not bring about a performance change on my computer. > > > > So I still think only parallel loops work. > > I checked GCC 13 and do see vectorized loops when parallelizing. > > Richard. > > > Hanke Zhang > > > > > > > > > I found that these two options do not work at the same time, that is, > > > > if I use the `-ftree-vectorize` option alone, it can bring a big > > > > efficiency gain compared to doing nothing; At the same time, if I use > > > > the option of `-ftree-parallelize-loops` alone, it will also bring a > > > > big efficiency gain. But if I use both options, vectorization fails, > > > > that is, I can't get the benefits of vectorization, I can only get the > > > > benefits of parallelizing loops. > > > > > > > > I know that the reason may be that after parallelizing the loop, > > > > vectorization cannot be performed, but is there any way I can reap the > > > > benefits of both optimizations? > > > > > > > > Here is my example program, adapted from the 462.libquantum in > > > > speccpu2006: > > > > > > > > ``` > > > > #include > > > > #include > > > > #include > > > > > > > > #define MAX_UNSIGNED unsigned long long > > > > > > > > struct quantum_reg_node_struct { > > > > float _Complex *amplitude; /* alpha_j */ > > > > MAX_UNSIGNED *state; /* j */ > > > > }; > > > > > > > > typedef struct quantum_reg_node_struct quantum_reg_node; > > > > > > > > struct quantum_reg_struct { > > > > int width; /* number of qubits in the qureg */ > > > > int size; /* number of non-zero vectors */ > > > > int hashw; /* width of the hash array */ > > > > quantum_reg_node *node; > > > > int *hash; > > > > }; > > > > > > > > typedef struct quantum_reg_struct quantum_reg; > > > > > > > > void quantum_toffoli(int control1, int control2, int target, > > > > quantum_reg *reg) { > > > > for (int i = 0; i < reg->size; i++) { > > > > if (reg->node->state[i] & ((MAX_UNSIGNED)1 << control1)) { > > > > if (reg->node->state[i] & ((
How to enable DCE during late_ipa_passes?
Hi, I am currently developing a new pass in the late_ipa_passes phase, located behind the pass_ipa_pta and in front of the pass_omp_simd_clone. I now hope that after the pass I developed, the code can be vectorized, that is, the code that can be vectorized during the pass_omp_simd_clone. But not yet, I'm guessing that maybe it's because I added some dead code during the pass I developed, so I want to be followed by a DCE pass right after my pass, hoping that this will allow pass_omp_simd_clone to vectorize the code after my pass. But I've found that adding NEXT_PASS(pass_dce) directly after my pass results in a runtime error, so I would like to ask if there is any good way to get pass_dce to work in the late_ipa_passes phase? Hanke Zhang.
Re: How to enable DCE during late_ipa_passes?
Thanks! That works! Richard Biener 于2023年9月18日周一 15:56写道: > > On Mon, Sep 18, 2023 at 9:17 AM Hanke Zhang via Gcc wrote: > > > > Hi, I am currently developing a new pass in the late_ipa_passes phase, > > located behind the pass_ipa_pta and in front of the > > pass_omp_simd_clone. I now hope that after the pass I developed, the > > code can be vectorized, that is, the code that can be vectorized > > during the pass_omp_simd_clone. > > > > But not yet, I'm guessing that maybe it's because I added some dead > > code during the pass I developed, so I want to be followed by a DCE > > pass right after my pass, hoping that this will allow > > pass_omp_simd_clone to vectorize the code after my pass. > > > > But I've found that adding NEXT_PASS(pass_dce) directly after my pass > > results in a runtime error, so I would like to ask if there is any > > good way to get pass_dce to work in the late_ipa_passes phase? > > You could try doing > > NEXT_PASS (pass_your_ipa_pass) > PUSH_INSERT_PASSES_WITHIN (pass_your_ipa_pass) > NEXT_PASS (pass_dce); > POP_INSERT_PASES () > NEXT_PASS (pass_omp_simd_clone) > > > > Hanke Zhang.
Loop fusion in gcc
Hi, I have been very interested in loop fusion recently. I found that both LLVM and icc have implemented this optimization. I also noticed that gcc does not seem to implement it. I would like to ask if gcc have any plans to implement this optimization? In addition, I also found that there is a function called `fuse_loop` in gcc/gimple-loop-jam.cc. Can it achieve the purpose of loop fusion? Hanke Zhang.
The order of loop traversal in gcc
Hi, I have recently been working on loops in gcc, and I have some questions about the loop traversal. I use loops_list(cfun, LI_ONLY_INNERMOST) to traverse the loops in my pass to obtain the loop. I found that the order of traversal and the order of actual instruction execution will be different. Sometimes it's the exact opposite, and sometimes it's sequential. I would like to ask how to predict its behavior? And what method should I use if I want to obtain sequential traversal? Thanks. Hanke Zhang.
Re: The order of loop traversal in gcc
Richard Biener 于2023年9月25日周一 13:46写道: > > > > > Am 25.09.2023 um 04:53 schrieb Hanke Zhang via Gcc : > > > > Hi, I have recently been working on loops in gcc, and I have some > > questions about the loop traversal. > > > > I use loops_list(cfun, LI_ONLY_INNERMOST) to traverse the loops in my > > pass to obtain the loop. > > > > I found that the order of traversal and the order of actual > > instruction execution will be different. > > > > Sometimes it's the exact opposite, and sometimes it's sequential. I > > would like to ask how to predict its behavior? And what method should > > I use if I want to obtain sequential traversal? > > The order of loops is according to their index, if you require traversal in > program order you have to order them yourself, for example by sorting them > after their header program order. > > Richard > Thanks. Get it! > > Thanks. > > Hanke Zhang.
Question about merging if-else blocks
Hi, I have recently been working on merging if-else statement blocks, and I found a rather bizarre phenomenon that I would like to ask about. A rough explanation is that for two consecutive if-else blocks, if their if statements are exactly the same, they should be merged, like the following program: int a = atoi(argv[1]); if (a) { printf("if 1"); } else { printf("else 1"); } if (a) { printf("if 2"); } else { printf("else 2"); } After using the -O3 -flto optimization option, it can be optimized as follows: int a = atoi(argv[1]); if (a) { printf("if 1"); printf("if 2"); } else { printf("else 1"); printf("else 2"); } But `a` here is a local variable. If I declare a as a global variable, it cannot be optimized as above. I would like to ask why this is? And is there any solution? Thanks. Hanke Zhang.
Re: Question about merging if-else blocks
Thanks! I understand what you mean, then can I think that if the function here is not an external function, but a function visible to the compiler and the function doesn't modify `a`, then these two blocks can be merged? Marc Glisse 于2023年9月27日周三 12:51写道: > > On Wed, 27 Sep 2023, Hanke Zhang via Gcc wrote: > > > Hi, I have recently been working on merging if-else statement blocks, > > and I found a rather bizarre phenomenon that I would like to ask > > about. > > A rough explanation is that for two consecutive if-else blocks, if > > their if statements are exactly the same, they should be merged, like > > the following program: > > > > int a = atoi(argv[1]); > > if (a) { > > printf("if 1"); > > } else { > > printf("else 1"); > > } > > if (a) { > > printf("if 2"); > > } else { > > printf("else 2"); > > } > > > > After using the -O3 -flto optimization option, it can be optimized as > > follows: > > > > int a = atoi(argv[1]); > > if (a) { > > printf("if 1"); > > printf("if 2"); > > } else { > > printf("else 1"); > > printf("else 2"); > > } > > > > But `a` here is a local variable. If I declare a as a global variable, > > it cannot be optimized as above. I would like to ask why this is? And > > is there any solution? > > If 'a' is a global variable, how do you know 'printf' doesn't modify its > value? (you could know it for printf, but it really depends on the > function that is called) > > -- > Marc Glisse
Check whether the global variable has been modified in a function or some blocks
Hi, I have recently been working on issues related to the changing values of global variables. That is, I was trying to develop a gimple pass, which needs to check whether the value of a global variable is modified in the a function or some blocks. Some of the more tricky cases are as follows: int type; // global variable void foo() { int tmp; tmp = type; // store type in tmp type = 0; // do other things // type will be used or changed // ... type = tmp; return; } Obviously, this function does not modify the value of type, but from the compiler's point of view, it seems not so obvious. I'd like to ask is there any good way to solve it? Or does gcc provide some ready-made implementations to use already?
Re: Question about merging if-else blocks
Richard Biener 于2023年9月27日周三 15:30写道: > > On Wed, Sep 27, 2023 at 7:21 AM Hanke Zhang via Gcc wrote: > > > > Thanks! I understand what you mean, then can I think that if the > > function here is not an external function, but a function visible to > > the compiler and the function doesn't modify `a`, then these two > > blocks can be merged? > > Yes. The key transform you'd see before any of the merging is > CSE of the loads from 'a', then the rest is equivalent to the local > variable case. > > Richard. Hi, Richard I'm still a little confused about this. I want to change the default behavior of gcc. We know that printf won't change the value of 'a'. I'd like to let the compiler to get this information as well. How can I do that? Or which pass should I focus on? By disassembling the exe file generated by icc, I found that icc will merge these two blocks with the example code below. So I think there maybe some ways to make it. Thanks. Hanke Zhang. > > > Marc Glisse 于2023年9月27日周三 12:51写道: > > > > > > On Wed, 27 Sep 2023, Hanke Zhang via Gcc wrote: > > > > > > > Hi, I have recently been working on merging if-else statement blocks, > > > > and I found a rather bizarre phenomenon that I would like to ask > > > > about. > > > > A rough explanation is that for two consecutive if-else blocks, if > > > > their if statements are exactly the same, they should be merged, like > > > > the following program: > > > > > > > > int a = atoi(argv[1]); > > > > if (a) { > > > > printf("if 1"); > > > > } else { > > > > printf("else 1"); > > > > } > > > > if (a) { > > > > printf("if 2"); > > > > } else { > > > > printf("else 2"); > > > > } > > > > > > > > After using the -O3 -flto optimization option, it can be optimized as > > > > follows: > > > > > > > > int a = atoi(argv[1]); > > > > if (a) { > > > > printf("if 1"); > > > > printf("if 2"); > > > > } else { > > > > printf("else 1"); > > > > printf("else 2"); > > > > } > > > > > > > > But `a` here is a local variable. If I declare a as a global variable, > > > > it cannot be optimized as above. I would like to ask why this is? And > > > > is there any solution? > > > > > > If 'a' is a global variable, how do you know 'printf' doesn't modify its > > > value? (you could know it for printf, but it really depends on the > > > function that is called) > > > > > > -- > > > Marc Glisse
Question about function splitting
Hi, I have some questions about the strategy and behavior of function splitting in gcc, like the following code: int glob; void f() { if (glob) { printf("short path\n"); return; } // do lots of expensive things // ... } I hope it can be broken down like below, so that the whole function can perhaps be inlined, which is more efficient. int glob; void f() { if (glob) { printf("short path\n"); return; } f_part(); } void f_part() { // do lots of expensive things // ... } But on the contrary, gcc splits it like these, which not only does not bring any benefits, but may increase the time consumption, because the function call itself is a more resource-intensive thing. int glob; void f() { if (glob) { f_part(); return; } // do lots of expensive things // ... } void f_part() { printf("short path\n"); // just do this } Are there any options I can offer to gcc to change this behavior? Or do I need to make some changes in ipa-split.cc?
Re: Question about function splitting
Martin Jambor 于2023年10月3日周二 00:34写道: > > Hello, > > On Mon, Oct 02 2023, Hanke Zhang via Gcc wrote: > > Hi, I have some questions about the strategy and behavior of function > > splitting in gcc, like the following code: > > > > int glob; > > void f() { > > if (glob) { > > printf("short path\n"); > > return; > > } > > // do lots of expensive things > > // ... > > } > > > > I hope it can be broken down like below, so that the whole function > > can perhaps be inlined, which is more efficient. > > > > int glob; > > void f() { > > if (glob) { > > printf("short path\n"); > > return; > > } > > f_part(); > > } > > > > void f_part() { > > // do lots of expensive things > > // ... > > } > > > > > > But on the contrary, gcc splits it like these, which not only does not > > bring any benefits, but may increase the time consumption, because the > > function call itself is a more resource-intensive thing. > > > > int glob; > > void f() { > > if (glob) { > > f_part(); > > return; > > } > > // do lots of expensive things > > // ... > > } > > > > void f_part() { > > printf("short path\n"); // just do this > > } > > > > Are there any options I can offer to gcc to change this behavior? Or > > do I need to make some changes in ipa-split.cc? > > I'd suggest you file a bug to Bugzilla with a specific example that is > mis-handled, then we can have a look and discuss what and why happens > and what can be done about it. > > Thanks, > > Martin Hi, thanks for your reply. I'm trying to create an account right now. And I put a copy of the example code here in case someone is interested. And I'm using gcc 12.3.0. When you complie the code below via 'gcc test.c -O3 -flto -fdump-tree-fnsplit', you will find a phenomenon that is consistent with what I described above in the gimple which is dumped from fnsplit. #include #include int opstatus; unsigned char *objcode = 0; unsigned long position = 0; char *globalfile; int test_split_write(char *file) { FILE *fhd; if (!opstatus) { // short path here printf("Object code generation not active! Forgot to call " "quantum_objcode_start?\n"); return 1; } if (!file) file = globalfile; fhd = fopen(file, "w"); if (fhd == 0) return -1; fwrite(objcode, position, 1, fhd); fclose(fhd); int *arr = malloc(1000); for (int i = 0; i < 1000; i++) { arr[i] = rand(); } return 0; } // to avoid `test_split_write` inlining into main void __attribute__((noinline)) call() { test_split_write("./txt"); } int main() { opstatus = rand(); objcode = malloc(100); position = 0; call(); return 0; }
Function return value can't be infered when it's not inlined
Hi, I'm a little confused about the behavior of gcc when the function is not inlined. Here is an example code: int __attribute__((noinline)) foo() { return 1; } int main() { if (foo()) { printf("foo() returned 1\n"); } else { printf("foo() returned 0\n"); } return 0; } After compiling this via `-O3 -flto`, the else block isn't been optimized and still exists. Even it's so obvious that the function will return '1', can't the compiler see that? Does gcc only get this information by inlining the function? Or is that what the gcc does? If so, how to make a change to let gcc get this information then? Thanks Hanke Zhang
Re: Function return value can't be infered when it's not inlined
Richard Biener 于2023年10月4日周三 16:43写道: > > On Wed, Oct 4, 2023 at 10:37 AM Richard Biener > wrote: > > > > On Tue, Oct 3, 2023 at 6:30 PM Hanke Zhang via Gcc wrote: > > > > > > Hi, I'm a little confused about the behavior of gcc when the function > > > is not inlined. > > > > > > Here is an example code: > > > > > > int __attribute__((noinline)) foo() { > > > return 1; > > > } > > > > > > int main() { > > > if (foo()) { > > > printf("foo() returned 1\n"); > > > } else { > > > printf("foo() returned 0\n"); > > > } > > > return 0; > > > } > > > > > > After compiling this via `-O3 -flto`, the else block isn't been > > > optimized and still exists. > > > > > > Even it's so obvious that the function will return '1', can't the > > > compiler see that? Does gcc only get this information by inlining the > > > function? Or is that what the gcc does? > > > > > > If so, how to make a change to let gcc get this information then? > > > > I think IPA-CP would be doing this but the issue is that historically > > 'noinline' also disabled other IPA transforms and we've kept that > > for backward compatibility even when we introduced the separate 'noipa' > > attribute. Thanks. The initial example is a function that uses va_args as parameters. It cannot be inlined because of va_args, and then its return value cannot be predicted like above. For example, the following function: int foo (int num, ...) { va_list args; va_start(args, num); int a1 = va_arg(args, int); int a2 = va_arg(args, int); printf("a1 = %d, a2 = %d\n", a1, a2); va_end(args); return 1; } int main() { if (foo(2, rand(), rand())) { printf("foo() returned 1\n"); } else { printf("foo() returned 0\n"); } return 0; } Wouldn't such a function be optimized like 'noinline'? > > Oh, and I forgot that IIRC neither IPA CP nor IPA SRA handle return > functions in a way exposing this to optimization passes (there's no > way to encode this in fnspec, we'd need some return value value-range > and record that and make VRP/ranger query it on calls). > > Richard. > Thanks. So, does that mean I have to let VRP/ranger be able to query the return value so that the else block can be optimized out? > > Richard. > > > > > > > > Thanks > > > Hanke Zhang
Re: Question about function splitting
But when I change the code 'opstatus = rand()' to 'opstatus = rand() %2', the probability of opstatus being 0 should be 50%, but the result remains the same, i.e. still split at that point. And the specific information can be found in Bugzilla, the link is https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111672 Richard Biener 于2023年10月4日周三 16:20写道: > > On Mon, Oct 2, 2023 at 7:15 PM Hanke Zhang via Gcc wrote: > > > > Martin Jambor 于2023年10月3日周二 00:34写道: > > > > > > Hello, > > > > > > On Mon, Oct 02 2023, Hanke Zhang via Gcc wrote: > > > > Hi, I have some questions about the strategy and behavior of function > > > > splitting in gcc, like the following code: > > > > > > > > int glob; > > > > void f() { > > > > if (glob) { > > > > printf("short path\n"); > > > > return; > > > > } > > > > // do lots of expensive things > > > > // ... > > > > } > > > > > > > > I hope it can be broken down like below, so that the whole function > > > > can perhaps be inlined, which is more efficient. > > > > > > > > int glob; > > > > void f() { > > > > if (glob) { > > > > printf("short path\n"); > > > > return; > > > > } > > > > f_part(); > > > > } > > > > > > > > void f_part() { > > > > // do lots of expensive things > > > > // ... > > > > } > > > > > > > > > > > > But on the contrary, gcc splits it like these, which not only does not > > > > bring any benefits, but may increase the time consumption, because the > > > > function call itself is a more resource-intensive thing. > > > > > > > > int glob; > > > > void f() { > > > > if (glob) { > > > > f_part(); > > > > return; > > > > } > > > > // do lots of expensive things > > > > // ... > > > > } > > > > > > > > void f_part() { > > > > printf("short path\n"); // just do this > > > > } > > > > > > > > Are there any options I can offer to gcc to change this behavior? Or > > > > do I need to make some changes in ipa-split.cc? > > > > > > I'd suggest you file a bug to Bugzilla with a specific example that is > > > mis-handled, then we can have a look and discuss what and why happens > > > and what can be done about it. > > > > > > Thanks, > > > > > > Martin > > > > Hi, thanks for your reply. > > > > I'm trying to create an account right now. And I put a copy of the > > example code here in case someone is interested. > > > > And I'm using gcc 12.3.0. When you complie the code below via 'gcc > > test.c -O3 -flto -fdump-tree-fnsplit', you will find a phenomenon that > > is consistent with what I described above in the gimple which is > > dumped from fnsplit. > > I think fnsplit currently splits out _cold_ code, I suppose !opstatus > is predicted to be false most of the time. > > It looks like your intent is to inline this very early check as > > if (!opstatus) { test_split_write_1 (..); } else { test_split_write_2 (..); > } > > to possibly elide that test? I would guess that IPA-CP is supposed to > do this but eventually refuses to create a clone for this case since > it would be large. > > Unfortunately function splitting doesn't run during IPA transforms, > but maybe IPA-CP can be teached how to avoid the expensive clone > by performing what IPA split does in the case a check in the entry > block which splits control flow can be optimized? > > Richard. > > > #include > > #include > > > > int opstatus; > > unsigned char *objcode = 0; > > unsigned long position = 0; > > char *globalfile; > > > > int test_split_write(char *file) { > > FILE *fhd; > > > > if (!opstatus) { > > // short path here > > printf("Object code generation not active! Forgot to call " > >"quantum_objcode_start?\n"); > > return 1; > > } > > > > if (!file) > > file = globalfile; > > > > fhd = fopen(file, "w"); > > > > if (fhd == 0) > > return -1; > > > > fwrite(objcode, position, 1, fhd); > > > > fclose(fhd); > > > > int *arr = malloc(1000); > > for (int i = 0; i < 1000; i++) { > > arr[i] = rand(); > > } > > > > return 0; > > } > > > > // to avoid `test_split_write` inlining into main > > void __attribute__((noinline)) call() { test_split_write("./txt"); } > > > > int main() { > > opstatus = rand(); > > objcode = malloc(100); > > position = 0; > > call(); > > return 0; > > }
why not optimize static local variables
Hi, I've recently been working on static local variables in C. I would like to ask about some questions about that. For example, for the following program, void foo() { static int x = 0; x++; } int main() { foo(); } After optimization with the -O3 -flto option, the entire program will look something like this: int main() { x_foo++; } The question I want to ask is, why not optimize the 'x_foo++' line of code out? Because its scope will only be in foo, and it will not be read at all for the entire program. Is it because it is stored in the global data area? Or are there other security issues? Thanks Hanke Zhang
the elimination of if blocks in GCC during if-conversion and vectorization
Hi, I'm recently working on vectorization of GCC. I'm stuck in a small problem and would like to ask for advice. For example, for the following code: int main() { int size = 1000; int *foo = malloc(sizeof(int) * size); int c1 = rand(), t1 = rand(); for (int i = 0; i < size; i++) { if (foo[i] & c1) { foo[i] = t1; } } // prevents the loop above from being optimized for (int i = 0; i < size; i++) { printf("%d", foo[i]); } } First of all, the if statement block in the loop will be converted to a MASK_STORE through if-conversion optimization. But after tree-vector, it will still become a branched form. The part of the final disassembly structure probably looks like below(Using IDA to do this), and you can see that there is still such a branch 'if ( !_ZF )' in it, which will lead to low efficiency. do { while ( 1 ) { __asm { vpand ymm0, ymm2, ymmword ptr [rax] vpcmpeqd ymm0, ymm0, ymm1 vpcmpeqd ymm0, ymm0, ymm1 vptest ymm0, ymm0 } if ( !_ZF ) break; _RAX += 8; if ( _RAX == v9 ) goto LABEL_5; } __asm { vpmaskmovd ymmword ptr [rax], ymm0, ymm3 } _RAX += 8; } while ( _RAX != v9 ); Why can't we just replace the vptest and if statement with some other instructions like vpblendvb so that it can be faster? Or is there a good way to do that? Thanks Hanke Zhang
Question about gimple code during optimizing if-conversion
Hi, I'm working on optimizing if-conversion for my own business recently. I got a problem here. I tried to optimize it in such a case, for example, when a conditional statement block has only if statement and no else statement, the source C code looks like this: int* foo; // assume this has been initialized int c = rand(), t = rand(), size = 1000; for (int i = 0; i < size; i++) { if (foo[i] & (1 << c)) foo[i] ^= (1 << t); } Part of its corresponding gimple is optimized like this before if-conversion: : # i_71 = PHI # ivtmp_9 = PHI _5 = (long unsigned int) i_71; _6 = _5 * 4; _7 = foo_23 + _6; _8 = *_7; shifttmp_75 = _8 & shifttmp_76; if (shifttmp_75 != 0) goto ; [50.00%] else goto ; [50.00%] [local count: 531502205]: goto ; [100.00%] [local count: 531502204]: _12 = _8 ^ _11; *_7 = _12; [local count: 1063004409]: i_39 = i_71 + 1; ivtmp_73 = ivtmp_9 - 1; if (ivtmp_73 != 0) goto ; [99.00%] else goto ; [1.00%] I want to add some statements to gimple to make it like adding an else block to the source code. // What I expected: int* foo; // assume this has been initialized int c = rand(), t = rand(), size = 1000; for (int i = 0; i < size; i++) { if (foo[i] & (1 << c)) foo[i] ^= (1 << t); + else foo[i] = foo[i]; // I want to add a statment here ! } And of course I can't change the source code for real, so I can only add a pass in front of if-conversion to modify the gimple. For the example above, I know that I have to add them in the block '', but what confuses me is that I don't know what kind of statement to add to be legal due to my poor experience. I try to add something like this below, but the compile error just happened. So I'm here for help. What kind of statements should I add here? [local count: 531502205]: + *_7 = *_7 goto ; [100.00%] Finally, The reason I did this was to avoid MASK_STORE generation, because it might add an if branch in the final assembly which I don't like it to be. And after such a modification, if-conversion should have been changed it to the form of a ternary expression, which would reduce the occurrence of branches after final vectorization and produce more efficient code. Or there if is a better way to get rid of MASK_STORE, please tell me about that. :) Thanks Hanke Zhang
Check whether a function is a pure function
Hi, I'm trying to write a pass to erase some useless functions or to put it another way, detect whether a function is pure or not. (Of course I know some passes can do the clean after inline) Here is the problem I got, If a function satisfy the following points, can it be considered safe to delete? 1. return value is ignored or has no return 2. params are pure 32bit Integer Const or has no params 3. No use of global variable 4. No function call, assembly code, goto statement 5. No cast And in my pass, I try to do these things through these points: 1. check GIMPLE_RETURN statement and check all the function call points 2. check DECL_PARAM(fndecl) and check the input params at all the function call points 3. check that fndecl is not in the global_var->referring list 4. check GIMPLE_CALL, GIMPLE_ASM, GIMPLE_GOTO are not present 5. check that there is no difference between TREE_TYPE(lhs) and TREE_TYPE(rhs1) types in the GIMPLE_ASSIGN statement I would like to ask that if there are any omissions or errors in the prerequisites and corresponding implementation plans I listed above. Thanks Hanke Zhang
Re: the elimination of if blocks in GCC during if-conversion and vectorization
Richard Biener 于2023年10月17日周二 17:26写道: > > On Thu, Oct 12, 2023 at 2:18 PM Hanke Zhang via Gcc wrote: > > > > Hi, I'm recently working on vectorization of GCC. I'm stuck in a small > > problem and would like to ask for advice. > > > > For example, for the following code: > > > > int main() { > > int size = 1000; > > int *foo = malloc(sizeof(int) * size); > > int c1 = rand(), t1 = rand(); > > > > for (int i = 0; i < size; i++) { > > if (foo[i] & c1) { > > foo[i] = t1; > > } > > } > > > > // prevents the loop above from being optimized > > for (int i = 0; i < size; i++) { > > printf("%d", foo[i]); > > } > > } > > > > First of all, the if statement block in the loop will be converted to > > a MASK_STORE through if-conversion optimization. But after > > tree-vector, it will still become a branched form. The part of the > > final disassembly structure probably looks like below(Using IDA to do > > this), and you can see that there is still such a branch 'if ( !_ZF )' > > in it, which will lead to low efficiency. > > > > do > > { > > while ( 1 ) > > { > > __asm > > { > > vpand ymm0, ymm2, ymmword ptr [rax] > > vpcmpeqd ymm0, ymm0, ymm1 > > vpcmpeqd ymm0, ymm0, ymm1 > > vptest ymm0, ymm0 > > } > > if ( !_ZF ) > > break; > > _RAX += 8; > > if ( _RAX == v9 ) > > goto LABEL_5; > > } > > __asm { vpmaskmovd ymmword ptr [rax], ymm0, ymm3 } > > _RAX += 8; > > } > > while ( _RAX != v9 ); > > > > Why can't we just replace the vptest and if statement with some other > > instructions like vpblendvb so that it can be faster? Or is there a > > good way to do that? > > The branch is added by optimize_mask_stores after vectorization because > fully masked (disabled) masked stores can incur a quite heavy penalty on > some architectures when fault assists (read-only pages, but also COW pages) > are ran into. All the microcode handling needs to possibly be carried out > multiple times, for each such access to the same page. That can cause > a 1000x slowdown when you hit this case. Thus every masked store > is replaced by > > if (mask != 0) >masked_store (); > > and this is an optimization (which itself has a small cost). > > Richard. Yeah, I know that and I have seen the code of optimize_mask_store(). And the main problem here is that when multiple MASK_STORE appear in the same loop, many branches will appear, resulting in a decrease in overall efficiency. And my original idea is that why can't we replace MASK_STORE with more effective SIMD instructions because icc can do much better in this case. Then I give it up, because the ability to analyze vectorization of gcc is not as good as icc and my ability does not support me modifying this part of the code. Thanks very much for your reply. > > > > > Thanks > > Hanke Zhang
Re: the elimination of if blocks in GCC during if-conversion and vectorization
Hi Richard I get it, thank you again. And I got another problem, so I'd like ask it by the way. Can the left shift of the induction variable in a loop be optimized as a constant? Like the code below: int ans = 0; int width = rand() % 16; for (int j = 0; j < width; j++) ans += 1 << (j + width) into: int width = rand() % 16; ans = (1 << (2 * width) - (1 << width)); I came across a more complex version of that and found that gcc doesn't seem to handle it, so wanted to write a pass myself to optimize it. I got two questions here. Does GCC have such optimizations? If I want to do my own optimization, where should I put it? Put it behind the pass_iv_optimize? Thanks Hanke Zhang Richard Biener 于2023年10月17日周二 20:00写道: > > On Tue, Oct 17, 2023 at 1:54 PM Hanke Zhang wrote: > > > > Richard Biener 于2023年10月17日周二 17:26写道: > > > > > > On Thu, Oct 12, 2023 at 2:18 PM Hanke Zhang via Gcc > > > wrote: > > > > > > > > Hi, I'm recently working on vectorization of GCC. I'm stuck in a small > > > > problem and would like to ask for advice. > > > > > > > > For example, for the following code: > > > > > > > > int main() { > > > > int size = 1000; > > > > int *foo = malloc(sizeof(int) * size); > > > > int c1 = rand(), t1 = rand(); > > > > > > > > for (int i = 0; i < size; i++) { > > > > if (foo[i] & c1) { > > > > foo[i] = t1; > > > > } > > > > } > > > > > > > > // prevents the loop above from being optimized > > > > for (int i = 0; i < size; i++) { > > > > printf("%d", foo[i]); > > > > } > > > > } > > > > > > > > First of all, the if statement block in the loop will be converted to > > > > a MASK_STORE through if-conversion optimization. But after > > > > tree-vector, it will still become a branched form. The part of the > > > > final disassembly structure probably looks like below(Using IDA to do > > > > this), and you can see that there is still such a branch 'if ( !_ZF )' > > > > in it, which will lead to low efficiency. > > > > > > > > do > > > > { > > > > while ( 1 ) > > > > { > > > > __asm > > > > { > > > > vpand ymm0, ymm2, ymmword ptr [rax] > > > > vpcmpeqd ymm0, ymm0, ymm1 > > > > vpcmpeqd ymm0, ymm0, ymm1 > > > > vptest ymm0, ymm0 > > > > } > > > > if ( !_ZF ) > > > > break; > > > > _RAX += 8; > > > > if ( _RAX == v9 ) > > > > goto LABEL_5; > > > > } > > > > __asm { vpmaskmovd ymmword ptr [rax], ymm0, ymm3 } > > > > _RAX += 8; > > > > } > > > > while ( _RAX != v9 ); > > > > > > > > Why can't we just replace the vptest and if statement with some other > > > > instructions like vpblendvb so that it can be faster? Or is there a > > > > good way to do that? > > > > > > The branch is added by optimize_mask_stores after vectorization because > > > fully masked (disabled) masked stores can incur a quite heavy penalty on > > > some architectures when fault assists (read-only pages, but also COW > > > pages) > > > are ran into. All the microcode handling needs to possibly be carried out > > > multiple times, for each such access to the same page. That can cause > > > a 1000x slowdown when you hit this case. Thus every masked store > > > is replaced by > > > > > > if (mask != 0) > > >masked_store (); > > > > > > and this is an optimization (which itself has a small cost). > > > > > > Richard. > > > > Yeah, I know that and I have seen the code of optimize_mask_store(). > > And the main problem here is that when multiple MASK_STORE appear in > > the same loop, many branches will appear, resulting in a decrease in > > overall efficiency. > > > > And my original idea is that why can't we replace MASK_STORE with more > > effective SIMD instructions because icc can do much better in this > > case. > > ICC probably doesn't care for the case where foo[] isn't writable. In > fact for the case at hand we see it comes from malloc() which we > can assume to return writable memory I guess. That means if-conversion > can treat the unconditional read as a way to also allow to speculate > the write (with -fallow-store-data-races). > > Note there's currently no pointer analysis that tracks writability. > > > Then I give it up, because the ability to analyze vectorization > > of gcc is not as good as icc and my ability does not support me > > modifying this part of the code. > > > > Thanks very much for your reply. > > You're welcome. > > Richard. > > > > > > > > > > > > Thanks > > > > Hanke Zhang
Question about the pass_fre about merging if blocks.
Hi, I'm trying to make pass_fre work better for me. But I got a problem. Like the code below: int glob; void __attribute__((oninline)) foo() { // do nothing meaningful } int main() { if (glob) { foo(); } else { // do something that won't change glob } if (glob) { foo(); } else { // do something that won't change glob } } I'm trying to merge the two if blocks. And of course it won't work. I see the code in tree-ssa-structalias.cc, it will do this check: static bool pt_solution_includes_1 (struct pt_solution *pt, const_tree decl) { // if (pt->nonlocal && is_global_var (decl)) return true; // } So the pt->nonlocal will prevent the merge. I would like to ask if I can make this check less rigid by providing more information about function foo(). Because obviously the foo function does not change the value of glob here, but it is caused by the fact that it is noinline. So if I can prove that foo won't change glob and pass this infomation to this check, my goal was achieved. Is this possible? Thanks Hanke Zhang
Re: Question about the pass_fre about merging if blocks.
Richard Biener 于2023年10月20日周五 21:33写道: > > On Fri, Oct 20, 2023 at 1:48 PM Hanke Zhang via Gcc wrote: > > > > Hi, I'm trying to make pass_fre work better for me. But I got a > > problem. Like the code below: > > > > int glob; > > > > void __attribute__((oninline)) > > foo() { > > // do nothing meaningful > > } > > > > int main() { > > if (glob) { > > foo(); > > } else { > > // do something that won't change glob > > } > > > > if (glob) { > > foo(); > > } else { > > // do something that won't change glob > > } > > } > > > > I'm trying to merge the two if blocks. And of course it won't work. I > > see the code in tree-ssa-structalias.cc, it will do this check: > > > > static bool > > pt_solution_includes_1 (struct pt_solution *pt, const_tree decl) { > > // > > if (pt->nonlocal > > && is_global_var (decl)) > > return true; > > // > > } > > > > So the pt->nonlocal will prevent the merge. I would like to ask if I > > can make this check less rigid by providing more information about > > function foo(). Because obviously the foo function does not change the > > value of glob here, but it is caused by the fact that it is noinline. > > > > So if I can prove that foo won't change glob and pass this infomation > > to this check, my goal was achieved. Is this possible? > > In case 'glob' were static IPA reference has this info and we'd already > use it from call_may_clobber_ref_p. There's IPA mod-ref which is > a bit more powerful than IPA reference but I think it does not record > this precise info. > > Note that the problem with non-static globals is that we do not know > whether they get modified indirectly because they might have their > address take in another TU and that address passed back into the TU. > Usually using LTO helps here and we can promote the decl to hidden > visibility. Hi Richard: Thanks for your replying. Yeah, I know that. (We do not know whether they get modified indirectly because they might have their address take in another TU and that address passed back into the TU.) But I think in my case, I think I have to let go of some security concerns. And LTO is enabled already in my case, but that doesn't help. So I can understand that the information includes whether Foo uses glob directly is actually used, but for security reasons, GCC does not use it as a basis here (pt_solution_includes_1). Right? So if I want to change the default behavior of GCC, can I use call_may_clobber_ref_p to check out and make it merge here? Thanks Hanke Zhang > > Richard. > > > > > Thanks > > Hanke Zhang
Re: Question about the pass_fre about merging if blocks.
Richard Biener 于2023年10月20日周五 23:27写道: > > > > > Am 20.10.2023 um 16:33 schrieb Hanke Zhang : > > > > Richard Biener 于2023年10月20日周五 21:33写道: > >> > >>> On Fri, Oct 20, 2023 at 1:48 PM Hanke Zhang via Gcc > >>> wrote: > >>> > >>> Hi, I'm trying to make pass_fre work better for me. But I got a > >>> problem. Like the code below: > >>> > >>> int glob; > >>> > >>> void __attribute__((oninline)) > >>> foo() { > >>> // do nothing meaningful > >>> } > >>> > >>> int main() { > >>> if (glob) { > >>>foo(); > >>> } else { > >>>// do something that won't change glob > >>> } > >>> > >>> if (glob) { > >>>foo(); > >>> } else { > >>>// do something that won't change glob > >>> } > >>> } > >>> > >>> I'm trying to merge the two if blocks. And of course it won't work. I > >>> see the code in tree-ssa-structalias.cc, it will do this check: > >>> > >>> static bool > >>> pt_solution_includes_1 (struct pt_solution *pt, const_tree decl) { > >>> // > >>> if (pt->nonlocal > >>> && is_global_var (decl)) > >>>return true; > >>> // > >>> } > >>> > >>> So the pt->nonlocal will prevent the merge. I would like to ask if I > >>> can make this check less rigid by providing more information about > >>> function foo(). Because obviously the foo function does not change the > >>> value of glob here, but it is caused by the fact that it is noinline. > >>> > >>> So if I can prove that foo won't change glob and pass this infomation > >>> to this check, my goal was achieved. Is this possible? > >> > >> In case 'glob' were static IPA reference has this info and we'd already > >> use it from call_may_clobber_ref_p. There's IPA mod-ref which is > >> a bit more powerful than IPA reference but I think it does not record > >> this precise info. > >> > >> Note that the problem with non-static globals is that we do not know > >> whether they get modified indirectly because they might have their > >> address take in another TU and that address passed back into the TU. > >> Usually using LTO helps here and we can promote the decl to hidden > >> visibility. > > > > Hi Richard: > > > > Thanks for your replying. > > > > Yeah, I know that. (We do not know whether they get modified > > indirectly because they might have their address take in another TU > > and that address passed back into the TU.) > > > > But I think in my case, I think I have to let go of some security > > concerns. And LTO is enabled already in my case, but that doesn't > > help. > > It’s not security it’s correctness. > > > > > So I can understand that the information includes whether Foo uses > > glob directly is actually used, but for security reasons, GCC does not > > use it as a basis here (pt_solution_includes_1). Right? > > I don’t think this function is the correct place for the fix. I’d instead > put it into … > > > So if I want to change the default behavior of GCC, can I use > > call_may_clobber_ref_p to check out and > > … this function where it would be more targeted. > > As said, you are going to miscompile valid code. Note it should be possible > to enhance IPA reference dependent on how foo actually looks like (like if > there are no pointer based accesses in it). Hi Richard: Thanks. I understand it now. But I got another problem while handling this. I create new nodes from existing nodes in my own passes which are located at all_late_ipa_passes. So when I tried to get the ipa_reference_optimization_summary in pass_fre, I got nothing because old nodes were replaced by the new nodes I created before. Is there a way that I can get the ipa_reference_summay via my new nodes? The main code of getting ipa_reference_summary is here: written = ipa_reference_get_written_global (node); Or is it because I'm replacing nodes in the wrong way? Thanks Hanke Zhang > > Richard. > > > make it merge here? > > > > Thanks > > Hanke Zhang > > > >> > >> Richard. > >> > >>> > >>> Thanks > >>> Hanke Zhang
Re: Question about the pass_fre about merging if blocks.
Richard Biener 于2023年10月21日周六 18:23写道: > > > > > Am 21.10.2023 um 10:58 schrieb Hanke Zhang : > > > > Richard Biener 于2023年10月20日周五 23:27写道: > >> > >> > >> > >>>> Am 20.10.2023 um 16:33 schrieb Hanke Zhang : > >>> > >>> Richard Biener 于2023年10月20日周五 21:33写道: > >>>> > >>>>> On Fri, Oct 20, 2023 at 1:48 PM Hanke Zhang via Gcc > >>>>> wrote: > >>>>> > >>>>> Hi, I'm trying to make pass_fre work better for me. But I got a > >>>>> problem. Like the code below: > >>>>> > >>>>> int glob; > >>>>> > >>>>> void __attribute__((oninline)) > >>>>> foo() { > >>>>> // do nothing meaningful > >>>>> } > >>>>> > >>>>> int main() { > >>>>> if (glob) { > >>>>> foo(); > >>>>> } else { > >>>>> // do something that won't change glob > >>>>> } > >>>>> > >>>>> if (glob) { > >>>>> foo(); > >>>>> } else { > >>>>> // do something that won't change glob > >>>>> } > >>>>> } > >>>>> > >>>>> I'm trying to merge the two if blocks. And of course it won't work. I > >>>>> see the code in tree-ssa-structalias.cc, it will do this check: > >>>>> > >>>>> static bool > >>>>> pt_solution_includes_1 (struct pt_solution *pt, const_tree decl) { > >>>>> // > >>>>> if (pt->nonlocal > >>>>> && is_global_var (decl)) > >>>>> return true; > >>>>> // > >>>>> } > >>>>> > >>>>> So the pt->nonlocal will prevent the merge. I would like to ask if I > >>>>> can make this check less rigid by providing more information about > >>>>> function foo(). Because obviously the foo function does not change the > >>>>> value of glob here, but it is caused by the fact that it is noinline. > >>>>> > >>>>> So if I can prove that foo won't change glob and pass this infomation > >>>>> to this check, my goal was achieved. Is this possible? > >>>> > >>>> In case 'glob' were static IPA reference has this info and we'd already > >>>> use it from call_may_clobber_ref_p. There's IPA mod-ref which is > >>>> a bit more powerful than IPA reference but I think it does not record > >>>> this precise info. > >>>> > >>>> Note that the problem with non-static globals is that we do not know > >>>> whether they get modified indirectly because they might have their > >>>> address take in another TU and that address passed back into the TU. > >>>> Usually using LTO helps here and we can promote the decl to hidden > >>>> visibility. > >>> > >>> Hi Richard: > >>> > >>> Thanks for your replying. > >>> > >>> Yeah, I know that. (We do not know whether they get modified > >>> indirectly because they might have their address take in another TU > >>> and that address passed back into the TU.) > >>> > >>> But I think in my case, I think I have to let go of some security > >>> concerns. And LTO is enabled already in my case, but that doesn't > >>> help. > >> > >> It’s not security it’s correctness. > >> > >>> > >>> So I can understand that the information includes whether Foo uses > >>> glob directly is actually used, but for security reasons, GCC does not > >>> use it as a basis here (pt_solution_includes_1). Right? > >> > >> I don’t think this function is the correct place for the fix. I’d instead > >> put it into … > >> > >>> So if I want to change the default behavior of GCC, can I use > >>> call_may_clobber_ref_p to check out and > >> > >> … this function where it would be more targeted. > >> > >> As said, you are going to miscompile valid code. Note it should be > >> possible to enhance IPA reference dependent on how foo actually looks like > >> (like if there are no pointer based accesse
Re: the elimination of if blocks in GCC during if-conversion and vectorization
Hi Richard: Thanks for your advice. But when I try a simpler example like the one below before looking at the code, GCC still does nothing. int main() { int width; scanf("%d", &width); int sum = 0; for (int i = 0; i < width; i++) sum += i; printf("%d\n", sum); } I tried O3 and LTO, but still the same. So I'd like to ask why, or am I doing something wrong? Thanks Hanke Zhang Richard Biener 于2023年10月19日周四 20:00写道: > > On Tue, Oct 17, 2023 at 2:39 PM Hanke Zhang wrote: > > > > Hi Richard > > I get it, thank you again. > > > > And I got another problem, so I'd like ask it by the way. Can the left > > shift of the induction variable in a loop be optimized as a constant? > > Like the code below: > > > > int ans = 0; > > int width = rand() % 16; > > for (int j = 0; j < width; j++) > > ans += 1 << (j + width) > > > > into: > > > > int width = rand() % 16; > > ans = (1 << (2 * width) - (1 << width)); > > > > I came across a more complex version of that and found that gcc > > doesn't seem to handle it, so wanted to write a pass myself to > > optimize it. > > > > I got two questions here. Does GCC have such optimizations? If I want > > to do my own optimization, where should I put it? Put it behind the > > pass_iv_optimize? > > GCC has the final value replacement pass (pass_scev_cprop) doing these > kind of transforms. Since 'ans' does not have an affine evolution this > case would need to be pattern matched (there are some existing pattern > matchings in the pass). > > > Thanks > > Hanke Zhang > > > > Richard Biener 于2023年10月17日周二 20:00写道: > > > > > > On Tue, Oct 17, 2023 at 1:54 PM Hanke Zhang wrote: > > > > > > > > Richard Biener 于2023年10月17日周二 17:26写道: > > > > > > > > > > On Thu, Oct 12, 2023 at 2:18 PM Hanke Zhang via Gcc > > > > > wrote: > > > > > > > > > > > > Hi, I'm recently working on vectorization of GCC. I'm stuck in a > > > > > > small > > > > > > problem and would like to ask for advice. > > > > > > > > > > > > For example, for the following code: > > > > > > > > > > > > int main() { > > > > > > int size = 1000; > > > > > > int *foo = malloc(sizeof(int) * size); > > > > > > int c1 = rand(), t1 = rand(); > > > > > > > > > > > > for (int i = 0; i < size; i++) { > > > > > > if (foo[i] & c1) { > > > > > > foo[i] = t1; > > > > > > } > > > > > > } > > > > > > > > > > > > // prevents the loop above from being optimized > > > > > > for (int i = 0; i < size; i++) { > > > > > > printf("%d", foo[i]); > > > > > > } > > > > > > } > > > > > > > > > > > > First of all, the if statement block in the loop will be converted > > > > > > to > > > > > > a MASK_STORE through if-conversion optimization. But after > > > > > > tree-vector, it will still become a branched form. The part of the > > > > > > final disassembly structure probably looks like below(Using IDA to > > > > > > do > > > > > > this), and you can see that there is still such a branch 'if ( !_ZF > > > > > > )' > > > > > > in it, which will lead to low efficiency. > > > > > > > > > > > > do > > > > > > { > > > > > > while ( 1 ) > > > > > > { > > > > > > __asm > > > > > > { > > > > > > vpand ymm0, ymm2, ymmword ptr [rax] > > > > > > vpcmpeqd ymm0, ymm0, ymm1 > > > > > > vpcmpeqd ymm0, ymm0, ymm1 > > > > > > vptest ymm0, ymm0 > > > > > > } > > > > > > if ( !_ZF ) > > > > > > break; > > > > > > _RAX += 8; > > > > > > if ( _RAX == v9 ) > > > > > > goto LABEL_5; > > > > > > } > > > > > > __asm { vpmaskmovd ymmword ptr [rax], ymm0, ymm3 } > > > > > > _RAX += 8;
Re: the elimination of if blocks in GCC during if-conversion and vectorization
Richard Biener 于2023年10月23日周一 20:32写道: > > On Mon, Oct 23, 2023 at 12:50 PM Hanke Zhang wrote: > > > > Hi Richard: > > > > Thanks for your advice. But when I try a simpler example like the one > > below before looking at the code, GCC still does nothing. > > > > int main() { > > int width; > > scanf("%d", &width); > > int sum = 0; > > for (int i = 0; i < width; i++) sum += i; > > printf("%d\n", sum); > > } > > > > I tried O3 and LTO, but still the same. So I'd like to ask why, or am > > I doing something wrong? > > -fdump-tree-sccp-details-scev reveals > > (set_scalar_evolution > instantiated_below = 5 > (scalar = sum_9) > (scalar_evolution = {0, +, {1, +, 1}_1}_1)) > ) > (chrec_apply > (varying_loop = 1) > (chrec = {0, +, {1, +, 1}_1}_1) > (x = (unsigned int) width.0_12 + 4294967295) > (res = scev_not_known)) > > so we don't know how to apply a variable number of iterations to > the affine expression {0, +, {1, +, 1}_1}_1, that is, we do not > know how to compute the final value of the reduction. > > For a constant, say width == 100 we do: > > (set_scalar_evolution > instantiated_below = 2 > (scalar = sum_6) > (scalar_evolution = {0, +, {1, +, 1}_1}_1)) > ) > (chrec_apply > (varying_loop = 1) > (chrec = {0, +, {1, +, 1}_1}_1) > (x = 99) > (res = 4950)) Yeah, I also found this result in previous experiments. But what confuses me is that if the 'width' can't be inferred to INTEGER_CST, there's nothing we can do, right? Because in my case, the variables corresponding to 'width' are almost all undetermined values, such as 'width = rand()'. That said, I can hardly get any optimizations in my cases, right? Thanks Hanke Zhang > > Richard. > > > > > Thanks > > Hanke Zhang > > > > Richard Biener 于2023年10月19日周四 20:00写道: > > > > > > On Tue, Oct 17, 2023 at 2:39 PM Hanke Zhang wrote: > > > > > > > > Hi Richard > > > > I get it, thank you again. > > > > > > > > And I got another problem, so I'd like ask it by the way. Can the left > > > > shift of the induction variable in a loop be optimized as a constant? > > > > Like the code below: > > > > > > > > int ans = 0; > > > > int width = rand() % 16; > > > > for (int j = 0; j < width; j++) > > > > ans += 1 << (j + width) > > > > > > > > into: > > > > > > > > int width = rand() % 16; > > > > ans = (1 << (2 * width) - (1 << width)); > > > > > > > > I came across a more complex version of that and found that gcc > > > > doesn't seem to handle it, so wanted to write a pass myself to > > > > optimize it. > > > > > > > > I got two questions here. Does GCC have such optimizations? If I want > > > > to do my own optimization, where should I put it? Put it behind the > > > > pass_iv_optimize? > > > > > > GCC has the final value replacement pass (pass_scev_cprop) doing these > > > kind of transforms. Since 'ans' does not have an affine evolution this > > > case would need to be pattern matched (there are some existing pattern > > > matchings in the pass). > > > > > > > Thanks > > > > Hanke Zhang > > > > > > > > Richard Biener 于2023年10月17日周二 20:00写道: > > > > > > > > > > On Tue, Oct 17, 2023 at 1:54 PM Hanke Zhang > > > > > wrote: > > > > > > > > > > > > Richard Biener 于2023年10月17日周二 17:26写道: > > > > > > > > > > > > > > On Thu, Oct 12, 2023 at 2:18 PM Hanke Zhang via Gcc > > > > > > > wrote: > > > > > > > > > > > > > > > > Hi, I'm recently working on vectorization of GCC. I'm stuck in > > > > > > > > a small > > > > > > > > problem and would like to ask for advice. > > > > > > > > > > > > > > > > For example, for the following code: > > > > > > > > > > > > > > > > int main() { > > > > > > > > int size = 1000; > > > > > > > > int *foo = malloc(sizeof(int) * size); > > > > > > > > int c1 = rand(), t1 = rand(); > > > > > > >
Question on replacing .MASK_STORE with ternary expressions
Hi, I got a gimple relative question. I'm trying to replace the .MASK_STORE with a ternary expression in pass_ifcvt like below: .MASK_STORE(addr, align, mask, value) => *addr = mask ? value : *addr; But when I do this, I'm stucked. The addr here is a SSA_NAME of course. When I try to get the value that 'addr' points to through a SSA_NAME as lhs like the code below, GCC reports an error. // get MASK_STORE gimple_stmt_iterator gsi = gsi_for_stmt (stmt); tree addr = gimple_call_arg (stmt, 0); tree mask = gimple_call_arg (stmt, 2); tree rhs = gimple_call_arg (stmt, 3); // deref_ssa = *addr tree deref_ssa = create_tmp_reg_or_ssa_name (TREE_TYPE (rhs), NULL); tree deref = build1 (INDIRECT_REF, TREE_TYPE (deref_ssa), addr); gimple *stmt1 = gimple_build_assign(deref_ssa, deref); // ssa1 = mask ? deref_ssa : rhs tree cond = build3 (COND_EXPR, TREE_TYPE (rhs), mask, rhs, deref_ssa); tree ssa1 = create_tmp_reg_or_ssa_name (TREE_TYPE (rhs), NULL); gimple *stmt2 = gimple_build_assign(ssa1, cond); // *addr = ssa1 tree indirect_ref = build1 (INDIRECT_REF, TREE_TYPE (deref_ssa), addr); gimple *stmt3 = gimple_build_assign(indirect_ref, ssa1); // insert stmts gsi_insert_before (&gsi, stmt1, GSI_SAME_STMT); gsi_insert_before (&gsi, stmt2, GSI_SAME_STMT); gsi_insert_before (&gsi, stmt3, GSI_SAME_STMT); // remove the origin MASK_STORE gsi_remove (&gsi, true); The error: _588 = *_589; _587 = _136 ? _272 : _588; *_589 = _587; unhandled expression in get_expr_operands(): unit-size align:32 warn_if_not_align:0 symtab:0 alias-set 8 canonical-type 0x7f25f57a12a0 precision:32 pointer_to_this > arg:0 public unsigned DI size unit-size align:64 warn_if_not_align:0 symtab:0 alias-set 20 structural-equality> def_stmt _589 = &IMAGPART_EXPR <_452->amplitude>; version:589 ptr-info 0x7f25f5716f18>> Is there a way to fix this? Thanks Hanke Zhang
Re: Question on replacing .MASK_STORE with ternary expressions
Hi Richard: I think I need to get the value that 'addr' points to, is MEM_REF really right? And I try to replace INDIRECT_REF with MEM_REF like this: tree deref = build1 (MEM_REF , TREE_TYPE (deref_ssa), addr); // ... tree indirect_ref = build1 (MEM_REF , TREE_TYPE (deref_ssa), addr); And the error is different, I think that means the type is not fit. 0x7afca4 build1(tree_code, tree_node*, tree_node*) ../../gcc/tree.cc:4968 0xe8292b replace_mask_store() ../../gcc/tree-if-conv.cc:3486 0xe8a03d execute ../../gcc/tree-if-conv.cc:3563 0xe8a03d execute ../../gcc/tree-if-conv.cc:3549 Richard Biener 于2023年10月25日周三 23:40写道: > > > > > Am 25.10.2023 um 17:25 schrieb Hanke Zhang via Gcc : > > > > Hi, I got a gimple relative question. > > > > I'm trying to replace the .MASK_STORE with a ternary expression in > > pass_ifcvt like below: > > > > .MASK_STORE(addr, align, mask, value) => *addr = mask ? value : *addr; > > > > But when I do this, I'm stucked. The addr here is a SSA_NAME of > > course. When I try to get the value that 'addr' points to through a > > SSA_NAME as lhs like the code below, GCC reports an error. > > > > // get MASK_STORE > > gimple_stmt_iterator gsi = gsi_for_stmt (stmt); > > tree addr = gimple_call_arg (stmt, 0); > > tree mask = gimple_call_arg (stmt, 2); > > tree rhs = gimple_call_arg (stmt, 3); > > > > // deref_ssa = *addr > > tree deref_ssa = create_tmp_reg_or_ssa_name (TREE_TYPE (rhs), NULL); > > tree deref = build1 (INDIRECT_REF, TREE_TYPE (deref_ssa), addr); > > gimple *stmt1 = gimple_build_assign(deref_ssa, deref); > > > > // ssa1 = mask ? deref_ssa : rhs > > tree cond = build3 (COND_EXPR, TREE_TYPE (rhs), mask, rhs, deref_ssa); > > tree ssa1 = create_tmp_reg_or_ssa_name (TREE_TYPE (rhs), NULL); > > gimple *stmt2 = gimple_build_assign(ssa1, cond); > > > > // *addr = ssa1 > > tree indirect_ref = build1 (INDIRECT_REF, TREE_TYPE (deref_ssa), addr); > > gimple *stmt3 = gimple_build_assign(indirect_ref, ssa1); > > > > // insert stmts > > gsi_insert_before (&gsi, stmt1, GSI_SAME_STMT); > > gsi_insert_before (&gsi, stmt2, GSI_SAME_STMT); > > gsi_insert_before (&gsi, stmt3, GSI_SAME_STMT); > > > > // remove the origin MASK_STORE > > gsi_remove (&gsi, true); > > > > The error: > > > > _588 = *_589; > > _587 = _136 ? _272 : _588; > > *_589 = _587; > > unhandled expression in get_expr_operands(): > > >type >size > >unit-size > >align:32 warn_if_not_align:0 symtab:0 alias-set 8 > > canonical-type 0x7f25f57a12a0 precision:32 > >pointer_to_this > > > > >arg:0 >type > float> > >public unsigned DI > >size > >unit-size > >align:64 warn_if_not_align:0 symtab:0 alias-set 20 > > structural-equality> > > > >def_stmt _589 = &IMAGPART_EXPR <_452->amplitude>; > >version:589 > >ptr-info 0x7f25f5716f18>> > > > > Is there a way to fix this? > > You need to use a MEM_REF , not an INDIRECT_REF > > > Thanks > > Hanke Zhang
Re: Question on replacing .MASK_STORE with ternary expressions
Hi RIchard: Sorry about the previous mail. I made a mistake that I forgot the MEM_REF requires two parameters. After changing to the code here, everything is fine. tree deref = build1 (MEM_REF , TREE_TYPE (deref_ssa), addr, build_int_cst (TREE_TYPE (addr), 0)); // ... tree indirect_ref = build1 (MEM_REF , TREE_TYPE (deref_ssa), addr, build_int_cst (TREE_TYPE (addr), 0)); Thanks so much for your advice! That helps me a lot! :) Thanks Hanke Zhang Hanke Zhang 于2023年10月26日周四 00:09写道: > > Hi Richard: > > I think I need to get the value that 'addr' points to, is MEM_REF really > right? > > And I try to replace INDIRECT_REF with MEM_REF like this: > > tree deref = build1 (MEM_REF , TREE_TYPE (deref_ssa), addr); > // ... > tree indirect_ref = build1 (MEM_REF , TREE_TYPE (deref_ssa), addr); > > And the error is different, I think that means the type is not fit. > > 0x7afca4 build1(tree_code, tree_node*, tree_node*) > ../../gcc/tree.cc:4968 > 0xe8292b replace_mask_store() > ../../gcc/tree-if-conv.cc:3486 > 0xe8a03d execute > ../../gcc/tree-if-conv.cc:3563 > 0xe8a03d execute > ../../gcc/tree-if-conv.cc:3549 > > Richard Biener 于2023年10月25日周三 23:40写道: > > > > > > > > > Am 25.10.2023 um 17:25 schrieb Hanke Zhang via Gcc : > > > > > > Hi, I got a gimple relative question. > > > > > > I'm trying to replace the .MASK_STORE with a ternary expression in > > > pass_ifcvt like below: > > > > > > .MASK_STORE(addr, align, mask, value) => *addr = mask ? value : *addr; > > > > > > But when I do this, I'm stucked. The addr here is a SSA_NAME of > > > course. When I try to get the value that 'addr' points to through a > > > SSA_NAME as lhs like the code below, GCC reports an error. > > > > > > // get MASK_STORE > > > gimple_stmt_iterator gsi = gsi_for_stmt (stmt); > > > tree addr = gimple_call_arg (stmt, 0); > > > tree mask = gimple_call_arg (stmt, 2); > > > tree rhs = gimple_call_arg (stmt, 3); > > > > > > // deref_ssa = *addr > > > tree deref_ssa = create_tmp_reg_or_ssa_name (TREE_TYPE (rhs), NULL); > > > tree deref = build1 (INDIRECT_REF, TREE_TYPE (deref_ssa), addr); > > > gimple *stmt1 = gimple_build_assign(deref_ssa, deref); > > > > > > // ssa1 = mask ? deref_ssa : rhs > > > tree cond = build3 (COND_EXPR, TREE_TYPE (rhs), mask, rhs, deref_ssa); > > > tree ssa1 = create_tmp_reg_or_ssa_name (TREE_TYPE (rhs), NULL); > > > gimple *stmt2 = gimple_build_assign(ssa1, cond); > > > > > > // *addr = ssa1 > > > tree indirect_ref = build1 (INDIRECT_REF, TREE_TYPE (deref_ssa), addr); > > > gimple *stmt3 = gimple_build_assign(indirect_ref, ssa1); > > > > > > // insert stmts > > > gsi_insert_before (&gsi, stmt1, GSI_SAME_STMT); > > > gsi_insert_before (&gsi, stmt2, GSI_SAME_STMT); > > > gsi_insert_before (&gsi, stmt3, GSI_SAME_STMT); > > > > > > // remove the origin MASK_STORE > > > gsi_remove (&gsi, true); > > > > > > The error: > > > > > > _588 = *_589; > > > _587 = _136 ? _272 : _588; > > > *_589 = _587; > > > unhandled expression in get_expr_operands(): > > > > >type > >size > > >unit-size > > >align:32 warn_if_not_align:0 symtab:0 alias-set 8 > > > canonical-type 0x7f25f57a12a0 precision:32 > > >pointer_to_this > > > > > > >arg:0 > >type > > float> > > >public unsigned DI > > >size > > >unit-size > > >align:64 warn_if_not_align:0 symtab:0 alias-set 20 > > > structural-equality> > > > > > >def_stmt _589 = &IMAGPART_EXPR <_452->amplitude>; > > >version:589 > > >ptr-info 0x7f25f5716f18>> > > > > > > Is there a way to fix this? > > > > You need to use a MEM_REF , not an INDIRECT_REF > > > > > Thanks > > > Hanke Zhang
Question about function Split with va_args
Hi I've been working on function splits recently, and I've noticed that functions with va_args arguments won't be split, so why is that? I tried to understand the comments in the source code, but I still don't get the specific reason. At the same time, if I do want to split functions with va_args arguments when it's safe to do, how do I define security here? In other words, how should I know that a function with va_args can be split? Or I can't? Thanks Hanke Zhang
Question about the changing the default behavior of ipa-inline
Hi, I recently ran into an inline-related issue and would like to ask about it. This is about the ipa-inline. I'd like to make my function be inlined, but it trapped in the function 'want_inline_small_function_p', more specificly, in the conditional statement 'growth_positive_p (callee, e, growth)'. The offline size of the function is too small and the growth is much bigger. But I still hope it will be inlined. Besides, the reason for its failure in the report file was CIF_MAX_INLINE_INSNS_AUTO_LIMIT. But I don't seem to be able to fix that by modifying the value of that parameter. Is there another way to solve it? Or can I only change the source code to make it? Thanks Hanke Zhang
Question about vectorization optimization during RTL-PASS
Hi, I've been working on vectorization-related optimization lately. GCC seems to have some optimization vulnerabilities. I would like to ask if it can be solved. For example, for the following program using AVX2: #include // reg->node2[i].state is an unsigned long long variable // reg->size is an integer variable that represents the iterations for (int i = 0; i < reg->size; i+=4) { /* original code: unsigned long long state = reg->node2[i].state; if (state & (1LLU << j + 1 | 1LLU << width + j)) state ^= (1LLU << j); state ^= (1LLU << width + j); */ __m256i state = _mm256_loadu_si256((__m256i *)((char*)(reg->node2) + i * sizeof(unsigned long long))); __m256i mask1 = _mm256_set1_epi64x(1LLU << j + 1 | 1LLU << width + j); // cmp __m256i tmp1 = _mm256_and_si256(state, mask1); __m256i cmp1 = _mm256_cmpeq_epi64(tmp1, mask1); // xor __m256i xor_param = _mm256_set1_epi64x(1LLU << j); __m256i tmp2 = _mm256_and_si256(xor_param, cmp1); __m256i xor_result = _mm256_xor_si256(state, tmp2); // xor __m256i xor_param2 = _mm256_set1_epi64x(1LLU << width + j); __m256i xor_res2 = _mm256_xor_si256(xor_result, xor_param2); _mm256_storeu_si256((__m256i *)((char*)(reg->node2) + i * sizeof(unsigned long long)), xor_res2); } My expectation is to generate assembly code like this: vpxor ymm6, ymm2, ymmword ptr [r9+r15*8] vpand ymm4, ymm1, ymm6 vpcmpeqq ymm5, ymm4, ymm1 vpand ymm7, ymm3, ymm5 vpxor ymm8, ymm6, ymm7 vmovdqu ymmword ptr [r9+r15*8], ymm8 But the actual generated assembly code looks like this: vpand ymm0, ymm2, ymmword ptr [rsi+rax*8] vpxor ymm1, ymm4, ymmword ptr [rsi+rax*8] vpcmpeqq ymm0, ymm0, ymm2 vpand ymm0, ymm0, ymm5 vpxor ymm0, ymm0, ymm1 vmovdqu ymmword ptr [rsi+rax*8], ymm0 That is, GCC has advanced the second XOR operation, and at the same time has an additional address fetch operation (ymmword ptr [rsi+rax*8]), which I think may lead to a decrease in efficiency, and I also found that this instruction accounts for a large proportion when I use perf. At the same time, I found that these operations are performed on RTL-PASS through dump-related files, and they don't seem to be easy to change. Is there a good way to get it to generate the assembly code I want? Is it possible to modify my own source files or GCC source code to get that?
Question about creating an outermost loop
Hi, I'm working on loop tiling recently. I want to add this optimization to GCC. But I have encoutered some problems here and ask for help. For the code below as an example: for (int i = 0; i < 12; i++) { for (int j = 0; j < arr.length; j++) { // arr.length may be huge // do something with arr[j] } } I want to create an outermost loop that wraps around the two loops of the inner layer, and at the same time change the loop variables of the innermost loop. The final result is as follows: for (int k = 0; k < 8192; k++) { for (int i = 0; i < 12; i++) { for (int j = 0; j < arr.length / 8192; j++) { // do something with arr[k * (arr.length / 8192) + j] } } } But I don't know how to do this properly. I'm stuck with virtual oprands and PHIs. Is there any existing optimization in GCC that I can refer to? Thanks. Hanke Zhang.
Question about creating global varaiable during IPA PASS.
Hi, I'm trying to create a global variable in my own PASS which located at the LATE_IPA_PASSES. (I'm using GCC 10.3.0.) And after creating it, I added the attributes like the following. // 1. create the var tree new_name = get_identifier (xx); tree new_type = build_pointer_type (xx); tree new_var = build_decl (UNKNOWN_LOCATION, VAR_DECL, new_name, new_type); add_attributes (new_var); static void add_attributes (tree var) { DECL_ARTIFICIAL (var) = 1; DECL_EXTERNAL (var) = 0; TREE_STATIC (var) = 1; TREE_PUBLIC (var) = 1; TREE_USED (var) = 1; DECL_CONTEXT (var) = NULL_TREE; TREE_THIS_VOLATILE (var) = 0; TREE_ADDRESSABLE (var) = 0; TREE_READONLY (var) = 0; if (is_global_var (var)) set_decl_tls_model (var, TLS_MODEL_NONE); } But when I try to compile some example files with -flto, error occurs. /usr/bin/ld: xxx.ltrans0.ltrans.o: in function `xxx': xxx.c: undefined reference to `glob_var' xxx.c: undefined reference to `glob_var' xxx.c: undefined reference to `glob_var' Here `glob_var' is the global varaiable created in my PASS. I would like to ask, am I using some attributes incorrectly? Thanks Hanke Zhang
Re: Question about creating global varaiable during IPA PASS.
Hi Thomas! Thanks for your reply. That's exactly what I'm missing. When I add varpool_node::finalize_decl() to my code, everything works fine! Thomas Schwinge 于2023年12月16日周六 01:15写道: > > Hi Hanke! > > On 2023-12-13T17:04:57+0800, Hanke Zhang via Gcc wrote: > > Hi, I'm trying to create a global variable in my own PASS which > > located at the LATE_IPA_PASSES. (I'm using GCC 10.3.0.) > > I can't comment on IPA aspects, or whether something was different on > oldish GCC 10 (why using that one, by the way?), and I've not actually > verified what you're doing here: > > > And after creating it, I added the attributes like the following. > > > > // 1. create the var > > tree new_name = get_identifier (xx); > > tree new_type = build_pointer_type (xx); > > tree new_var = build_decl (UNKNOWN_LOCATION, VAR_DECL, new_name, new_type); > > add_attributes (new_var); > > > > static void > > add_attributes (tree var) > > { > > DECL_ARTIFICIAL (var) = 1; > > DECL_EXTERNAL (var) = 0; > > TREE_STATIC (var) = 1; > > TREE_PUBLIC (var) = 1; > > TREE_USED (var) = 1; > > DECL_CONTEXT (var) = NULL_TREE; > > TREE_THIS_VOLATILE (var) = 0; > > TREE_ADDRESSABLE (var) = 0; > > TREE_READONLY (var) = 0; > > if (is_global_var (var)) > > set_decl_tls_model (var, TLS_MODEL_NONE); > > } > > > > But when I try to compile some example files with -flto, error occurs. > > > > /usr/bin/ld: xxx.ltrans0.ltrans.o: in function `xxx': > > xxx.c: undefined reference to `glob_var' > > xxx.c: undefined reference to `glob_var' > > xxx.c: undefined reference to `glob_var' > > > > Here `glob_var' is the global varaiable created in my PASS. > > > > I would like to ask, am I using some attributes incorrectly? > > ..., but are you maybe simply missing to > 'varpool_node::add (var);' or 'varpool_node::finalize_decl (var);' or > something like that? See other uses of those, and description in > 'gcc/cgraph.h', 'struct [...] varpool_node': > > /* Add the variable DECL to the varpool. > Unlike finalize_decl function is intended to be used > by middle end and allows insertion of new variable at arbitrary point > of compilation. */ > static void add (tree decl); > > /* Mark DECL as finalized. By finalizing the declaration, frontend > instruct > the middle end to output the variable to asm file, if needed or > externally > visible. */ > static void finalize_decl (tree decl); > > If that's not it, we'll have to look in more detail. > > > Grüße > Thomas
How to compress the size of a field in a structure?
Hi, I'm attempting to compress the size of a field in a structure for memory-friendly purposes. I created an IPA pass to achieve this, but I ran into some issues as follows: // original struct Foo { long a1; int a2; }; // modified struct Foo_update { int a1; int a2; }; For the example structure Foo, I use `TREE_TYPE (field) = integer_type_node` to compress the type of a1 from `long` to `int`. But I don't know how to update its corresponding SSA variables, because the number of them is huge. Is there any way to do it quickly? Thanks Hanke Zhang
Re: How to compress the size of a field in a structure?
Hi lain, Thanks for your thoughtful comments and concerns. The primary goal of compressing the structure fields is to achieve memory efficiency. I understand your points regarding potential issues with modifying the size and types of the structure. I guess rewriting all the corresponding statements may help with it. If there are some more complex situations, we will give up this optimization. But I'm more concerned about how to accomplish this. Thanks Hanke Zhang Iain Sandoe 于2024年1月13日周六 17:15写道: > > > > > On 13 Jan 2024, at 07:45, Hanke Zhang via Gcc wrote: > > > > Hi, I'm attempting to compress the size of a field in a structure for > > memory-friendly purposes. I created an IPA pass to achieve this, but I > > ran into some issues as follows: > > > > // original > > struct Foo { > > long a1; > > int a2; > > }; > > > > // modified > > struct Foo_update { > > int a1; > > int a2; > > }; > > > > For the example structure Foo, I use `TREE_TYPE (field) = > > integer_type_node` to compress the type of a1 from `long` to `int`. > > Hmm. I am curious about under what conditions this is expected to work. > > Altering the size of the structure (or the types it contains) would break > guarantees > expected by at least c-family languages - like sizeof, alignof > what about arrays of the type, or embedding the type in larger aggregates? > > Iain > > > But I don't know how to update its corresponding SSA variables, > > because the number of them is huge. Is there any way to do it quickly? > > > > Thanks > > Hanke Zhang >
How to get the default values for parameters?
Hi, I'm trying to get the default values for parameters of some functions in my GIMPLE-PASS. The example code is here: enum { edefault1 = 1, edefault2 = 2, edefault3 = 3 } void func(int p0, int p1 = edefault1, int p2 = edefault2, int p3 = edefault3) { // do other things } I want to get the value of `p1` here. But I didn't find a way to get it. And I noticed that the marco `DECL_INITIAL` cannot be applied on PARM_DECL. And the comments say that the default values for parameters are encoded in the type of the function. But I can't find it. If you know how to get the value, I'll be so grateful! :) Thanks Hanke Zhang
Question about constructing vector types in GIMPLE pass
Hi, I've been working on strengthening auto-vectorization on intel CPUs recently. I tried to do it in the GIMPLE pass. And I noticed that some vector types in the GIMPLE code are confusing to me. The example code is here: _1 = MEM[(const __m256i_u * {ref-all})_2]; I wondered how I could construct or get the type `(const __m256i_u * {ref-all})` in the GIMPLE pass. If you have any ideas that can help me. I'll be so grateful! :) Thanks Hanke Zhang
Re: Question about constructing vector types in GIMPLE pass
Hi Marc, Thanks for your reply. I want to create a new type similar to this one `(const __m256i_u * {ref-all})` indeed. And I try to create it via these calls: tree type = build_vector_type_for_mode (intDI_type_node, V4DImode); tree type_p = build_pointer_type_for_mode(type, VOIDmode, true); But when I print the `type_p`, it shows `vector(4) long int * {ref-all}`. So I'm confused if they are the same type or can be transferred to each other. And I'm stucked with another problem that, I want to call `__builtin_ia32_pmovmskb256` in the GIMPLE pass. But I found that this function is defined in `config/i386/i386-builtins.h`. And when I try to include this header file, the error will occur during the compilation. If you know any way to solve this problem, I would be very grateful. :) Thanks Hanke Zhang Marc Glisse 于2024年4月9日周二 03:01写道: > > On Mon, 8 Apr 2024, Hanke Zhang via Gcc wrote: > > > Hi, > > I've been working on strengthening auto-vectorization on intel CPUs > > recently. I tried to do it in the GIMPLE pass. And I noticed that some > > vector types in the GIMPLE code are confusing to me. The example code > > is here: > > > > _1 = MEM[(const __m256i_u * {ref-all})_2]; > > > > I wondered how I could construct or get the type `(const __m256i_u * > > {ref-all})` in the GIMPLE pass. > > > > If you have any ideas that can help me. I'll be so grateful! :) > > I am not sure what you are asking exactly. If you already have access to > such a MEM_REF, then the doc tells you where to look for this type: > > "The first operand is the pointer being dereferenced; it will always have > pointer or reference type. The second operand is a pointer constant > serving as constant offset applied to the pointer being dereferenced > with its type specifying the type to be used for type-based alias > analysis. > The type of the node specifies the alignment of the access." > > If you want to create a new type similar to this one, you can build it > with various tools: > > build_vector_type or build_vector_type_for_mode > build_pointer_type_for_mode(*, VOIDmode, true) to build a pointer that can > alias anything > build_qualified_type to add const (probably useless) > build_aligned_type to specify that it is unaligned > > -- > Marc Glisse
How to format code in GCC style?
Hi I'm trying to format my code in GCC style. But I don't know how to finish it. I try `indent` and `check_GNU_style.sh` in the `contrib` directory. But none of them seem to work well. So I wondered is there any official tools to do that? Thanks Hanke Zhang
Re: How to format code in GCC style?
Hi Filip, Thanks for your reply. But I'm not so familiar with VIM actually. Thus when I try the options listed there, nothing happened. (VSCODE is the IDE I'm using for development.) And I notice that there is a `vimrc` in the `contrib` directory, I tried it too, but sadly, it didn't work for me. Maybe I configured vim in some wrong way. Thanks Hanke Zhang Filip Kastl 于2024年4月30日周二 18:00写道: > > On Tue 2024-04-30 13:52:16, Hanke Zhang via Gcc wrote: > > Hi > > I'm trying to format my code in GCC style. > > > > But I don't know how to finish it. I try `indent` and > > `check_GNU_style.sh` in the `contrib` directory. But none of them seem > > to work well. > > > > So I wondered is there any official tools to do that? > > > > Thanks > > Hanke Zhang > > Hi Hanke, > > Not sure if this will help you but this is what I use: While writing code in > GCC style, I use the vim options listed here: > > https://gcc.gnu.org/wiki/FormattingCodeForGCC > > Then, before I submit a patch I run it through check_GNU_style.sh. This works > fine for me. > > Cheers, > Filip Kastl
Question about optimizing function pointers for direct function calls
Hi, I got a question about optimizing function pointers for direct function calls in C. Consider the following scenario: one of the fields of a structure is a function pointer, and all its assignments come from the same function. Can all its uses be replaced by direct calls to this function? So the later passes can do more optimizations. Here is the example: int add(int a, int b) { return a + b; } int sub(int a, int b) { return a - b; } struct Foo { int (*add)(int, int); }; int main() { struct Foo foo[5] = malloc(sizeof(struct Foo) * 5); for (int i = 0; i < 5; i++) { foo[i].add = add; } int sum = 0; for (int i = 0; i < 5; i++) { sum += foo[i].add(1, 2); } return 0; } Can I replace the code above to the code below? int add(int a, int b) { return a + b; } int sub(int a, int b) { return a - b; } struct Foo { int (*add)(int, int); }; int main() { struct Foo foo[5] = malloc(sizeof(struct Foo) * 5); for (int i = 0; i < 5; i++) { foo[i].add = add; } int sum = 0; for (int i = 0; i < 5; i++) { sum += add(1,2); } return 0; } My idea is to determine whether the assignment of the field is the same function, and if so, perform the replacement. Of course this is not a reasonable optimization, I just want to know if there are security issues in doing so, and if I want to do it in the IPA stage, is it possible? Thanks Hanke Zhang
Question about the SLP vectorizer failed to perform automatic vectorization in one case
Hi, I'm trying to studing the automatic vectorization optimization in GCC, but I found one case that SLP vectorizer failed to do such things. Here is the sample code: (also a simplification version of a function from the 625/525.x264 source code in SPEC CPU 2017) void pixel_sub_wxh(int16_t *diff, uint8_t *pix1, uint8_t *pix2) { for (int y = 0; y < 4; y++) { for (int x = 0; x < 4; x++) diff[x + y * 4] = pix1[x] - pix2[x]; pix1 += 16; pix2 += 32; } } When I compiled with `-O3 -mavx2/-msse4.2`, SLP vectorizer failed to vectorize it, and I got the following message when adding `-fopt-info-vec-all`. (The inner loop will be unrolled) :6:21: optimized: loop vectorized using 8 byte vectors :6:21: optimized: loop versioned for vectorization because of possible aliasing :5:6: note: vectorized 1 loops in function. :5:6: note: * Analysis failed with vector mode V8SI :5:6: note: * The result for vector mode V32QI would be the same :5:6: note: * Re-trying analysis with vector mode V16QI :5:6: note: * Analysis failed with vector mode V16QI :5:6: note: * Re-trying analysis with vector mode V8QI :5:6: note: * Analysis failed with vector mode V8QI :5:6: note: * Re-trying analysis with vector mode V4QI :5:6: note: * Analysis failed with vector mode V4QI If I manually use the type declaration provided by `immintrin.h` to rewrite the code, the code is as follows (which I hope the SLP vectorizer to be able to do) void pixel_sub_wxh_vec(int16_t *diff, uint8_t *pix1, uint8_t *pix2) { for (int y = 0; y < 4; y++) { __v4hi pix1_v = {pix1[0], pix1[1], pix1[2], pix1[3]}; __v4hi pix2_v = {pix2[0], pix2[1], pix2[2], pix2[3]}; __v4hi diff_v = pix1_v - pix2_v; *(long long *)(diff + y * 4) = (long long)diff_v; pix1 += 16; pix2 += 32; } } What I want to know is why SLP vectorizer can't vectorize the code here, and what changes do I need to make to SLP vectorizer or the source code if I want it to do so? Thanks Hanke Zhang
Re: Question about the SLP vectorizer failed to perform automatic vectorization in one case
Hi Biener, Thanks for your help! I have already open a bugreport here https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115252. Thanks Hanke Zhang Richard Biener 于2024年5月27日周一 21:14写道: > > On Sat, May 25, 2024 at 3:08 PM Hanke Zhang via Gcc wrote: > > > > Hi, > > I'm trying to studing the automatic vectorization optimization in GCC, > > but I found one case that SLP vectorizer failed to do such things. > > > > Here is the sample code: (also a simplification version of a function > > from the 625/525.x264 source code in SPEC CPU 2017) > > > > void pixel_sub_wxh(int16_t *diff, uint8_t *pix1, uint8_t *pix2) { > > for (int y = 0; y < 4; y++) { > > for (int x = 0; x < 4; x++) > > diff[x + y * 4] = pix1[x] - pix2[x]; > > pix1 += 16; > > pix2 += 32; > > The issue is these increments, with only four uint8_t elements accessed > we still want to fill up a vectors worth of them. > > In the end we succeed with v4hi / v8qi but also peel for gaps even though > we handle the half-load case fine. > > > } > > } > > > > When I compiled with `-O3 -mavx2/-msse4.2`, SLP vectorizer failed to > > vectorize it, and I got the following message when adding > > `-fopt-info-vec-all`. (The inner loop will be unrolled) > > > > :6:21: optimized: loop vectorized using 8 byte vectors > > :6:21: optimized: loop versioned for vectorization because of > > possible aliasing > > :5:6: note: vectorized 1 loops in function. > > ^^^ > > so you do see the vectorization as outlined above. > > > :5:6: note: * Analysis failed with vector mode V8SI > > :5:6: note: * The result for vector mode V32QI would be the same > > :5:6: note: * Re-trying analysis with vector mode V16QI > > :5:6: note: * Analysis failed with vector mode V16QI > > :5:6: note: * Re-trying analysis with vector mode V8QI > > :5:6: note: * Analysis failed with vector mode V8QI > > :5:6: note: * Re-trying analysis with vector mode V4QI > > :5:6: note: * Analysis failed with vector mode V4QI > > > > If I manually use the type declaration provided by `immintrin.h` to > > rewrite the code, the code is as follows (which I hope the SLP > > vectorizer to be able to do) > > > > void pixel_sub_wxh_vec(int16_t *diff, uint8_t *pix1, uint8_t *pix2) { > > for (int y = 0; y < 4; y++) { > > __v4hi pix1_v = {pix1[0], pix1[1], pix1[2], pix1[3]}; > > __v4hi pix2_v = {pix2[0], pix2[1], pix2[2], pix2[3]}; > > __v4hi diff_v = pix1_v - pix2_v; > > *(long long *)(diff + y * 4) = (long long)diff_v; > > We kind-of do it this way, just > > __v8qi pix1_v = {pix1[0], pix1[1], pix1[2], pix1[3], 0, 0, 0, 0}; > ... > > and then unpack __v8qi low to v4hi. > > And unfortunately the last two outer iterations are scalar because of the > gap issue. There's some PRs about this, I did start to work on improving > this, > I'm not sure this exact case is covered so can you open a new bugreport? > > > pix1 += 16; > > pix2 += 32; > > } > > } > > > > What I want to know is why SLP vectorizer can't vectorize the code > > here, and what changes do I need to make to SLP vectorizer or the > > source code if I want it to do so? > > > > Thanks > > Hanke Zhang
Question about generating vpmovzxbd instruction without using the interfaces in immintrin.h
Hi, I've recently been trying to hand-write code to trigger automatic vectorization optimizations in GCC on Intel x86 machines (without using the interfaces in immintrin.h), but I'm running into a problem where I can't seem to get the concise `vpmovzxbd` or similar instructions. My requirement is to convert 8 `uint8_t` elements to `int32_t` type and print the output. If I use the interface (_mm256_cvtepu8_epi32) in immintrin.h, the code is as follows: int immintrin () { int size = 1, offset = 3; uint8_t* a = malloc(sizeof(char) * size); __v8si b = (__v8si)_mm256_cvtepu8_epi32(*(__m128i *)(a + offset)); for (int i = 0; i < 8; i++) { printf("%d\n", b[i]); } } After compiling with -mavx2 -O3, you can get concise and efficient instructions. (You can see it here: https://godbolt.org/z/8ojzdav47) But if I do not use this interface and instead use a for-loop or the `__builtin_convertvector` interface provided by GCC, I cannot achieve the above effect. The code is as follows: typedef uint8_t v8qiu __attribute__ ((__vector_size__ (8))); int forloop () { int size = 1, offset = 3; uint8_t* a = malloc(sizeof(char) * size); v8qiu av = *(v8qiu *)(a + offset); __v8si b = {}; for (int i = 0; i < 8; i++) { b[i] = (a + offset)[i]; } for (int i = 0; i < 8; i++) { printf("%d\n", b[i]); } } int builtin_cvt () { int size = 1, offset = 3; uint8_t* a = malloc(sizeof(char) * size); v8qiu av = *(v8qiu *)(a + offset); __v8si b = __builtin_convertvector(av, __v8si); for (int i = 0; i < 8; i++) { printf("%d\n", b[i]); } } The instructions generated by both functions are redundant and complex, and are quite difficult to read compared to calling `_mm256_cvtepu8_epi32` directly. (You can see it here as well: https://godbolt.org/z/8ojzdav47) What I want to ask is: How should I write the source code to get assembly instructions similar to directly calling _mm256_cvtepu8_epi32? Or would it be easier if I modified the GIMPLE directly? But it seems that there is no relevant expression or interface directly corresponding to `vpmovzxbd` in GIMPLE. Thanks Hanke Zhang
How to represent a fallthrough condtion (with no else) in "Match and Simplify"?
Hi, I'm trying to study "Match and Simplify" recently, and I had this sample code: int main() { int n = 1000; int *a = malloc (sizeof(int) * n); int *b = malloc (sizeof(int) * n); int *c = malloc (sizeof(int) * n); for (int i = 0; i < n; i++) { if (a[i] & b[i]) { a[i] ^= c[i]; } } } But this code cannot be vectorized very well. I hope it can become like this: int main() { int n = 1000; int *a = malloc (sizeof(int) * n); int *b = malloc (sizeof(int) * n); int *c = malloc (sizeof(int) * n); for (int i = 0; i < n; i++) { int cond = ((a[i] & b[i]) == 1); unsigned int mask = cond ? -1 : 0; a[i] ^= (c[i] & mask); } } This can finally result in concise and efficient vectorized instructions. But I want to know if this can be achieved through "Match and Simplify"? Because when I tried to write the pattern, I found that the condtional statement here seemed not to be matched well, as there is not an else block. Or is this not possible with "Match and Simplify"? Is it possible to implement it in if-conversion? Thanks Hanke Zhang
Re: Question about optimizing function pointers for direct function calls
Richard Biener 于2024年5月24日周五 14:39写道: > > On Fri, May 24, 2024 at 5:53 AM Hanke Zhang via Gcc wrote: > > > > Hi, > > I got a question about optimizing function pointers for direct > > function calls in C. > > > > Consider the following scenario: one of the fields of a structure is a > > function pointer, and all its assignments come from the same function. > > Can all its uses be replaced by direct calls to this function? So the > > later passes can do more optimizations. > > > > Here is the example: > > > > int add(int a, int b) { return a + b; } > > int sub(int a, int b) { return a - b; } > > > > struct Foo { > > int (*add)(int, int); > > }; > > int main() > > { > > struct Foo foo[5] = malloc(sizeof(struct Foo) * 5); > > > > for (int i = 0; i < 5; i++) { > > foo[i].add = add; > > } > > > > int sum = 0; > > for (int i = 0; i < 5; i++) { > > sum += foo[i].add(1, 2); > > } > > > > return 0; > > } > > > > Can I replace the code above to the code below? > > > > int add(int a, int b) { return a + b; } > > int sub(int a, int b) { return a - b; } > > > > struct Foo { > > int (*add)(int, int); > > }; > > int main() > > { > > struct Foo foo[5] = malloc(sizeof(struct Foo) * 5); > > > > for (int i = 0; i < 5; i++) { > > foo[i].add = add; > > } > > > > int sum = 0; > > for (int i = 0; i < 5; i++) { > > sum += add(1,2); > > } > > > > return 0; > > } > > > > My idea is to determine whether the assignment of the field is the > > same function, and if so, perform the replacement. > > If it's as simple as above then sure, even CSE should do it. If you > can prove otherwise the memory location with the function pointer > always has the same value you are obviously fine. If you just > do not see any other store via 'add's FIELD_DECL then no, that > isn't good enough. Every void * store you do not know where it > goes might go to that slot. > > > Of course this is not a reasonable optimization, I just want to know > > if there are security issues in doing so, and if I want to do it in > > the IPA stage, is it possible? > > For the more general case you can do what we do for speculative > devirtualization - replace the code with > > sum += foo[i].add == add ? add (1,2) : foo[i].add(1,2); Hi Richard, I'm trying to do what you suggested. (sum += foo[i].add == add ? add (1,2) : foo[i].add(1,2);) I created a new IPA-Pass before IPA-INLINE and I made the changes on the GIMPLE in the "function_transform" stage. But my newly created "foo[i].add(1,2)" seems to fail to be inlined. And it was optimized out in the subsequent FRE. I would like to ask if there is any way to mark my newly created cgraph_edge as "inline" in the "function_transform" stage. Here is part of the code creating the call_stmt in true basic_block in my IPA-PASS: // ... // part of the code have been omitted auto_vec params; for (int i = 0; i < gimple_call_num_args (call_stmt); i++) { params.safe_push (gimple_call_arg (call_stmt, i)); } gimple* tbb_call = gimple_build_call_vec (fndecl, params); /// = fn() tree tbb_ssa; if (ret_val_flag) { tbb_ssa = make_ssa_name (TREE_TYPE (gimple_call_lhs(call_stmt)), NULL); gimple_call_set_lhs (tbb_call, tbb_ssa); /// _2 = fn() } gsi = gsi_start_bb (tbb); gsi_insert_before (&gsi, tbb_call, GSI_SAME_STMT); cgraph_edge* tbb_callee = node->create_edge (cgraph_node::get_create (fndecl), (gcall*)tbb_call, tbb_call->bb->count, true); // what should I do to this cgraph_edge to mark it to be inlined // ... Or should I not do these things in the function_transform stage? Thanks Hanke Zhang > > that way we can inline the direct call and hopefully the branch will be > well predicted. > > In some SPEC there is IIRC the situation where such speculative > devirtualization candidates can be found solely based on function > signature. With LTO/IPA you'd basically collect candidate targets > for each indirect call (possibly address-taken function definitions > with correct signature) and if there's just a single one you can > choose that as speculative devirt target. > > Speculative devirt as we have now of course works with profile > data to identify the most probable candidate. > > Richard. > > > > > Thanks > > Hanke Zhang
Re: Question about optimizing function pointers for direct function calls
Hi Richard, Thanks for your reply. I'm looking forward to hearing from Honza! But I still have a small question about the IPA passes. If I want to make such a conversion (sum += foo[i].add == add ? add (1,2) : foo[i].add (1,2);), I definitely need to change the function's body. But the article "Using summary information in IPA passes"(https://gcc.gnu.org/onlinedocs/gccint/link-time-optimization/using-summary- Information-in-ipa-passes.html) mentioned that if IPA-Pass needs to change the function body, it needs to be done in the transform stage, but the execution stage of IPA-inline has ended before the transform stage. So we still can’t mark the newly created function calls/cgraph_edge as inlineable. Is that correct? Or did I understand something wrong? Thanks Hanke Zhang Richard Biener 于2024年6月12日周三 18:49写道: > > On Wed, Jun 12, 2024 at 11:57 AM Hanke Zhang wrote: > > > > Richard Biener 于2024年5月24日周五 14:39写道: > > > > > > On Fri, May 24, 2024 at 5:53 AM Hanke Zhang via Gcc > > > wrote: > > > > > > > > Hi, > > > > I got a question about optimizing function pointers for direct > > > > function calls in C. > > > > > > > > Consider the following scenario: one of the fields of a structure is a > > > > function pointer, and all its assignments come from the same function. > > > > Can all its uses be replaced by direct calls to this function? So the > > > > later passes can do more optimizations. > > > > > > > > Here is the example: > > > > > > > > int add(int a, int b) { return a + b; } > > > > int sub(int a, int b) { return a - b; } > > > > > > > > struct Foo { > > > > int (*add)(int, int); > > > > }; > > > > int main() > > > > { > > > > struct Foo foo[5] = malloc(sizeof(struct Foo) * 5); > > > > > > > > for (int i = 0; i < 5; i++) { > > > > foo[i].add = add; > > > > } > > > > > > > > int sum = 0; > > > > for (int i = 0; i < 5; i++) { > > > > sum += foo[i].add(1, 2); > > > > } > > > > > > > > return 0; > > > > } > > > > > > > > Can I replace the code above to the code below? > > > > > > > > int add(int a, int b) { return a + b; } > > > > int sub(int a, int b) { return a - b; } > > > > > > > > struct Foo { > > > > int (*add)(int, int); > > > > }; > > > > int main() > > > > { > > > > struct Foo foo[5] = malloc(sizeof(struct Foo) * 5); > > > > > > > > for (int i = 0; i < 5; i++) { > > > > foo[i].add = add; > > > > } > > > > > > > > int sum = 0; > > > > for (int i = 0; i < 5; i++) { > > > > sum += add(1,2); > > > > } > > > > > > > > return 0; > > > > } > > > > > > > > My idea is to determine whether the assignment of the field is the > > > > same function, and if so, perform the replacement. > > > > > > If it's as simple as above then sure, even CSE should do it. If you > > > can prove otherwise the memory location with the function pointer > > > always has the same value you are obviously fine. If you just > > > do not see any other store via 'add's FIELD_DECL then no, that > > > isn't good enough. Every void * store you do not know where it > > > goes might go to that slot. > > > > > > > Of course this is not a reasonable optimization, I just want to know > > > > if there are security issues in doing so, and if I want to do it in > > > > the IPA stage, is it possible? > > > > > > For the more general case you can do what we do for speculative > > > devirtualization - replace the code with > > > > > > sum += foo[i].add == add ? add (1,2) : foo[i].add(1,2); > > > > Hi Richard, > > > > I'm trying to do what you suggested. (sum += foo[i].add == add ? add > > (1,2) : foo[i].add(1,2);) > > > > I created a new IPA-Pass before IPA-INLINE and I made the changes on > > the GIMPLE in the "function_transform" stage. But my newly created > > "foo[i].add(1,2)" seems to fail to be inlined. And it was optimized > > out in the subsequent FRE. I would like to ask if there is any w
Question about enabling additional flags in Ofast
Hi, I'm attempting to enable more flags in Ofast, but I've encountered some issues. For instance, if I want to add -flto-partition=one to Ofast, here is the modification I made to opts.cc /* -Ofast adds optimizations to -O3. */ { OPT_LEVELS_FAST, OPT_ffast_math, NULL, 1 }, { OPT_LEVELS_FAST, OPT_fallow_store_data_races, NULL, 1 }, { OPT_LEVELS_FAST, OPT_fsemantic_interposition, NULL, 0 }, ++ { OPT_LEVELS_FAST, OPT_flto_partition_, NULL, LTO_PARTITION_ONE }, However, when I run the tests (make -k check), I encounter some problems. One of the failed test files is gcc.dg/pr89342.c: /* PR other/89342 */ /* { dg-do compile } */ /* { dg-options "-O0" } */ __attribute__((optimize("Ofast"))) void foo (void) { __attribute__((optimize("no-inline"))) void bar (void) {} bar (); } The error message is as follows: ../gcc/testsuite/gcc.dg/pr89342.c:7:1: internal compiler error: ‘global_options’ are modified in local context 7 | { | ^ 0x9e8cf4 handle_optimize_attribute ../../gcc/c-family/c-attribs.cc:5568 0x8e0902 decl_attributes(tree_node**, tree_node*, int, tree_node*) ../../gcc/attribs.cc:872 0x8fe39c start_function(c_declspecs*, c_declarator*, tree_node*) ../../gcc/c/c-decl.cc:9519 0x962336 c_parser_declaration_or_fndef ../../gcc/c/c-parser.cc:2445 0x96b803 c_parser_external_declaration ../../gcc/c/c-parser.cc:1779 0x96c263 c_parser_translation_unit ../../gcc/c/c-parser.cc:1652 0x96c263 c_parse_file() ../../gcc/c/c-parser.cc:23372 0x9cf4a5 c_common_parse_file() ../../gcc/c-family/c-opts.cc:1240 Upon debugging, I found that the error occurs in the cl_optimization_compare(gcc_options *ptr1, gcc_options *ptr2) function in options-save.cc. Specifically, the discrepancy is here: ptr1->x_flag_lto_partition: 2, ptr2->x_flag_lto_partition: 1 This discrepancy leads to the error. Could you advise on how to resolve this issue? (Note: My ultimate goal is not to add the -flto-partition=one flag. This is just an example to illustrate the problem.) Thank you for your assistance. Thanks Hanke Zhang
Re: Question about optimizing dynamic_cast call in C++ at GIMPLE Level
Hi Florian, Thank you very much for your reply. This is not related to ELF. What I am trying to do is simply obtain the 'Derived' type node from the parameter that is retrieved via the '__dynamic_cast' call. I believe this should be a relatively simple coding problem. However, I am not very familiar with GCC source code at present, and that is the reason why I am asking for help here. Thanks, Hanke Zhang Florian Weimer 于2025年1月18日周六 20:50写道: > > * Hanke Zhang via Gcc: > > > I have recently been delving into optimizing dynamic_cast calls in C++ > > within the context of GIMPLE code. In particular, for scenarios > > involving single inheritance, I'm aiming to convert the following > > code: > > > > _1 = __dynamic_cast (obj_1(D), &_ZTI7Base, &_ZTI10Derived, 0); > > if (_1!= 0B) > > > > into a more direct and efficient form, such as: > > > > _2 = MEM[obj_1(D)]; > > _3 = MEM[_2 + -8B]; > > if (_3 == &_ZTI10Derived) > > > > However, during this process, I encountered an issue. I attempted to > > obtain the type node of the Derived class through the _ZTI10Derived, > > which I believe should be a VAR_DECL. Regrettably, my efforts were > > unsuccessful. > > Is this for ELF? I think our ELF ABIs require that a full string > comparison is performed if the pointers do not match. Otherwise, types > with different vtables will suddenly be considered distinct under RTTI. > The symbols for vtables can be hidden if “local: *;” is used in a linker > script, for example. Currently, as far as I understand it, this does > not have a direct impact on RTTI. > > Thanks, > Florian >
Question about optimizing dynamic_cast call in C++ at GIMPLE Level
Hi, I have recently been delving into optimizing dynamic_cast calls in C++ within the context of GIMPLE code. In particular, for scenarios involving single inheritance, I'm aiming to convert the following code: _1 = __dynamic_cast (obj_1(D), &_ZTI7Base, &_ZTI10Derived, 0); if (_1!= 0B) into a more direct and efficient form, such as: _2 = MEM[obj_1(D)]; _3 = MEM[_2 + -8B]; if (_3 == &_ZTI10Derived) However, during this process, I encountered an issue. I attempted to obtain the type node of the Derived class through the _ZTI10Derived, which I believe should be a VAR_DECL. Regrettably, my efforts were unsuccessful. I'm reaching out to inquire if there's a viable method to achieve this. Specifically, is there a way to retrieve the type node of the original class from the third parameter of the __dynamic_cast call? Additionally, if there are any other aspects I should be aware of during this optimization process, I would greatly appreciate your insights. Thank you very much for your time and assistance. Best regards, Hanke Zhang