Git push account
Which account should I use to push my local patch to git repo of gcc? I have a sourceware account that works for svn, but now it doesn't for git. Actually both below commands were tried, but failed. git push ssh://f...@sourceware.org/gcc/gcc.git ..., git push ssh://f...@gcc.gnu.org/gcc/gcc.git ... Thansk, Feng
Re: Git push account
Thanks. Feng From: Andreas Schwab Sent: Saturday, January 25, 2020 10:07 PM To: Feng Xue OS Cc: gcc@gcc.gnu.org Subject: Re: Git push account On Jan 25 2020, Feng Xue OS wrote: > Which account should I use to push my local patch to git repo of gcc? See <http://gcc.gnu.org/git/?p=gcc.git;a=summary>. > I have a sourceware account that works for svn, but now it doesn't for git. > Actually both below commands were tried, but failed. > git push ssh://f...@sourceware.org/gcc/gcc.git ..., > git push ssh://f...@gcc.gnu.org/gcc/gcc.git ... Replace /gcc/ by /git/. Andreas. -- Andreas Schwab, sch...@linux-m68k.org GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1 "And now for something completely different."
Re: Git push account
As pointed out by Andreas, I used wrong git repo URL. Now it works. Thanks, Feng From: Jeff Law Sent: Sunday, January 26, 2020 12:18 AM To: Feng Xue OS; gcc@gcc.gnu.org Subject: Re: Git push account On Sat, 2020-01-25 at 12:39 +, Feng Xue OS wrote: > Which account should I use to push my local patch to git repo of gcc? > I have a sourceware account that works for svn, but now it doesn't for git. > Actually both below commands were tried, but failed. > git push ssh://f...@sourceware.org/gcc/gcc.git ..., > git push ssh://f...@gcc.gnu.org/gcc/gcc.git ... It shouldn't matter. Under the hood sourceware.org and gcc.gnu.org are the same machine. jeff
How to modify Bugzilla tracker
Hi, I want to alter some information of an existing Bugzilla tracker, such as assignee, but found there is no entrance in page of the Bugzilla tracker to do this. How can I get it? Thanks, Feng
Re: How to modify Bugzilla tracker
I have a sourceware.org account, but can not correlate the account with Bugzilla login. Just want to assign a tracker to someone. Thanks, Feng From: Jonathan Wakely Sent: Wednesday, December 11, 2019 7:07 PM To: Feng Xue OS Cc: gcc@gcc.gnu.org Subject: Re: How to modify Bugzilla tracker On Wed, 11 Dec 2019 at 02:25, Feng Xue OS wrote: > > Hi, > > I want to alter some information of an existing Bugzilla tracker, such as > assignee, but found there is no entrance in page of the Bugzilla tracker to > do this. How can I get it? Usually that's restricted to GCC developers, using their u...@gcc.gnu.org login. What do you want to change?
Re: How to modify Bugzilla tracker
> On Thu, 12 Dec 2019 at 09:29, Andreas Schwab wrote: >> >> On Dez 12 2019, Jonathan Wakely wrote: >> >> > On Thu, 12 Dec 2019 at 09:19, Thomas Schwinge wrote: >> >> >> >> Hi! >> >> >> >> I'm not an admin there; CCing the Overseers. >> >> >> >> On 2019-12-12T01:58:46+, Feng Xue OS >> >> wrote: >> >> > I have a sourceware.org account, but can not correlate the account with >> >> > Bugzilla login. >> >> >> >> Yes: 'gcc.gnu.org' is a separate Bugzilla instance. >> >> >> >> > Just want to assign a tracker to someone. >> >> >> >> To be able to do that, as far as I know, you'll need to use (set up a >> >> new?) account associated with your 'f...@gcc.gnu.org' email address. >> >> Addresses '@gcc.gnu.org' automatically get special permissions. >> >> >> >> I see you do have an account 'f...@os.amperecomputing.com'. Maybe it's >> >> possible to add another email address to that one (or replace it?), or if >> >> not, I remember that I once (manually?) got two accounts merged into >> >> 'tschwi...@gcc.gnu.org', so that's another option for you, assuming you'd >> >> like just one account. >> > >> > @sourceware.org and @gcc.gnu.org accounts are also separate, and I >> > don't see your name in the GCC MAINTAINERS file (apologies if I missed >> > it). >> >> The f...@gcc.gnu.org account already exists, he just needs to use it. > Ah OK, so just use the "reset password" link to get access to that account > then. But I did not know I own f...@gcc.gnu.org, and do not know how to access the email, to which "reset password" instructions will be sent. Actually, I tried to ask for oversees to create a gcc.gnu.org account, but was told it is not needed for I already have one @sourceware.org. BTW: I'm not a maintainer. Thanks, Feng
Re: How to modify Bugzilla tracker
It works. Thanks, Feng From: gcc-ow...@gcc.gnu.org on behalf of Feng Xue OS Sent: Friday, December 13, 2019 10:13 AM To: Jonathan Wakely; Andreas Schwab Cc: Thomas Schwinge; gcc@gcc.gnu.org; overse...@sourceware.org Subject: Re: How to modify Bugzilla tracker > On Thu, 12 Dec 2019 at 09:29, Andreas Schwab wrote: >> >> On Dez 12 2019, Jonathan Wakely wrote: >> >> > On Thu, 12 Dec 2019 at 09:19, Thomas Schwinge wrote: >> >> >> >> Hi! >> >> >> >> I'm not an admin there; CCing the Overseers. >> >> >> >> On 2019-12-12T01:58:46+, Feng Xue OS >> >> wrote: >> >> > I have a sourceware.org account, but can not correlate the account with >> >> > Bugzilla login. >> >> >> >> Yes: 'gcc.gnu.org' is a separate Bugzilla instance. >> >> >> >> > Just want to assign a tracker to someone. >> >> >> >> To be able to do that, as far as I know, you'll need to use (set up a >> >> new?) account associated with your 'f...@gcc.gnu.org' email address. >> >> Addresses '@gcc.gnu.org' automatically get special permissions. >> >> >> >> I see you do have an account 'f...@os.amperecomputing.com'. Maybe it's >> >> possible to add another email address to that one (or replace it?), or if >> >> not, I remember that I once (manually?) got two accounts merged into >> >> 'tschwi...@gcc.gnu.org', so that's another option for you, assuming you'd >> >> like just one account. >> > >> > @sourceware.org and @gcc.gnu.org accounts are also separate, and I >> > don't see your name in the GCC MAINTAINERS file (apologies if I missed >> > it). >> >> The f...@gcc.gnu.org account already exists, he just needs to use it. > Ah OK, so just use the "reset password" link to get access to that account > then. But I did not know I own f...@gcc.gnu.org, and do not know how to access the email, to which "reset password" instructions will be sent. Actually, I tried to ask for oversees to create a gcc.gnu.org account, but was told it is not needed for I already have one @sourceware.org. BTW: I'm not a maintainer. Thanks, Feng
[RFC] Add new flag to specify output constraint in match.pd
Hi, There is a match-folding issue derived from pr94234. A piece of code like: int foo (int n) { int t1 = 8 * n; int t2 = 8 * (n - 1); return t1 - t2; } It can be perfectly caught by the rule "(A * C) +- (B * C) -> (A +- B) * C", and be folded to constant "8". But this folding will fail if both v1 and v2 have multiple uses, as the following code. int foo (int n) { int t1 = 8 * n; int t2 = 8 * (n - 1); use_fn (t1, t2); return t1 - t2; } Given an expression with non-single-use operands, folding it will introduce duplicated computation in most situations, and is deemed to be unprofitable. But it is always beneficial if final result is a constant or existing SSA value. And the rule is: (simplify (plusminus (mult:cs@3 @0 @1) (mult:cs@4 @0 @2)) (if ((!ANY_INTEGRAL_TYPE_P (type) || TYPE_OVERFLOW_WRAPS (type) || (INTEGRAL_TYPE_P (type) && tree_expr_nonzero_p (@0) && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION (type) /* If @1 +- @2 is constant require a hard single-use on either original operand (but not on both). */ && (single_use (@3) || single_use (@4))) <- control whether match or not (mult (plusminus @1 @2) @0))) Current matcher only provides a way to check something before folding, but no mechanism to affect decision after folding. If has, for the above case, we can let it go when we find result is a constant. Like the way to describe input operand using flags, we could also add a new flag to specify this kind of constraint on output that we expect it is a simple gimple value. Proposed syntax is (opcode:v{ condition } ) The char "v" stands for gimple value, if more descriptive, other char is preferred. "condition" enclosed by { } is an optional c-syntax condition expression. If present, only when "condition" is met, matcher will check whether folding result is a gimple value using gimple_simplified_result_is_gimple_val (). Since there is no SSA concept in GENERIC, this is only for GIMPLE-match, not GENERIC-match. With this syntax, the rule is changed to #Form 1: (simplify (plusminus (mult:cs@3 @0 @1) (mult:cs@4 @0 @2)) (if ((!ANY_INTEGRAL_TYPE_P (type) || TYPE_OVERFLOW_WRAPS (type) || (INTEGRAL_TYPE_P (type) && tree_expr_nonzero_p (@0) && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION (type)) ( if (!single_use (@3) && !single_use (@4)) (mult:v (plusminus @1 @2) @0))) (mult (plusminus @1 @2) @0) #Form 2: (simplify (plusminus (mult:cs@3 @0 @1) (mult:cs@4 @0 @2)) (if ((!ANY_INTEGRAL_TYPE_P (type) || TYPE_OVERFLOW_WRAPS (type) || (INTEGRAL_TYPE_P (type) && tree_expr_nonzero_p (@0) && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION (type)) (mult:v{ !single_use (@3) && !single_use (@4 } (plusminus @1 @2) @0 This is just a proposal, has not been implemented. Hope your comments on this. Best Regards, Feng
Re: [RFC] Add new flag to specify output constraint in match.pd
>> There is a match-folding issue derived from pr94234. A piece of code like: >> >> int foo (int n) >> { >> int t1 = 8 * n; >> int t2 = 8 * (n - 1); >> >> return t1 - t2; >> } >> >> It can be perfectly caught by the rule "(A * C) +- (B * C) -> (A +- B) * >> C", and >> be folded to constant "8". But this folding will fail if both v1 and v2 have >> multiple uses, as the following code. >> >> int foo (int n) >> { >> int t1 = 8 * n; >> int t2 = 8 * (n - 1); >> >> use_fn (t1, t2); >> return t1 - t2; >> } >> >> Given an expression with non-single-use operands, folding it will introduce >> duplicated computation in most situations, and is deemed to be unprofitable. >> But it is always beneficial if final result is a constant or existing SSA >> value. >> >> And the rule is: >> (simplify >>(plusminus (mult:cs@3 @0 @1) (mult:cs@4 @0 @2)) >>(if ((!ANY_INTEGRAL_TYPE_P (type) >> || TYPE_OVERFLOW_WRAPS (type) >> || (INTEGRAL_TYPE_P (type) >> && tree_expr_nonzero_p (@0) >> && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION >> (type) >>/* If @1 +- @2 is constant require a hard single-use on either >> original operand (but not on both). */ >>&& (single_use (@3) || single_use (@4))) <- control whether >> match or not >> (mult (plusminus @1 @2) @0))) >> >> Current matcher only provides a way to check something before folding, >> but no mechanism to affect decision after folding. If has, for the above >> case, we can let it go when we find result is a constant. > > :s already has a counter-measure where it still folds if the output is at > most one operation. So this transformation has a counter-counter-measure > of checking single_use explicitly. And now we want a counter^3-measure... > Counter-measure is key factor to matching-cost. ":s" seems to be somewhat coarse-grained. And here we do need more control over it. But ideally, we could decouple these counter-measures from definitions of match-rule, and let gimple-matcher get a more reasonable match-or-not decision based on these counters. Anyway, it is another story. >> Like the way to describe input operand using flags, we could also add >> a new flag to specify this kind of constraint on output that we expect >> it is a simple gimple value. >> >> Proposed syntax is >> >> (opcode:v{ condition } ) >> >> The char "v" stands for gimple value, if more descriptive, other char is >> preferred. "condition" enclosed by { } is an optional c-syntax condition >> expression. If present, only when "condition" is met, matcher will check >> whether folding result is a gimple value using >> gimple_simplified_result_is_gimple_val (). >> >> Since there is no SSA concept in GENERIC, this is only for GIMPLE-match, >> not GENERIC-match. >> >> With this syntax, the rule is changed to >> >> #Form 1: >> (simplify >>(plusminus (mult:cs@3 @0 @1) (mult:cs@4 @0 @2)) >>(if ((!ANY_INTEGRAL_TYPE_P (type) >> || TYPE_OVERFLOW_WRAPS (type) >> || (INTEGRAL_TYPE_P (type) >> && tree_expr_nonzero_p (@0) >> && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION >> (type)) >>( if (!single_use (@3) && !single_use (@4)) >> (mult:v (plusminus @1 @2) @0))) >> (mult (plusminus @1 @2) @0) > > That seems to match what you can do with '!' now (that's very recent). >> #Form 2: >> (simplify >>(plusminus (mult:cs@3 @0 @1) (mult:cs@4 @0 @2)) >>(if ((!ANY_INTEGRAL_TYPE_P (type) >> || TYPE_OVERFLOW_WRAPS (type) >> || (INTEGRAL_TYPE_P (type) >> && tree_expr_nonzero_p (@0) >> && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION >> (type)) >> (mult:v{ !single_use (@3) && !single_use (@4 } (plusminus @1 @2) @0 > > Indeed, something more flexible than '!' would be nice, but I am not so > sure about this versioxzln. If we are going to allow inserting code after > resimplification and before validation, maybe we should go even further > and let people insert arbitrary code there... Yes. I think so and tried to do that. For sure, result of resimplification must be core data referenced by such user code. But it is hard to unify result representation for GIMPLE and GENERIC matcher. And one more consideration is that how often is this generalized usage. Thanks, Feng
Re: [RFC] Add new flag to specify output constraint in match.pd
> >> >> >> There is a match-folding issue derived from pr94234. A piece of code >> >> like: >> >> >> >> int foo (int n) >> >> { >> >> int t1 = 8 * n; >> >> int t2 = 8 * (n - 1); >> >> >> >> return t1 - t2; >> >> } >> >> >> >> It can be perfectly caught by the rule "(A * C) +- (B * C) -> (A +- B) * >> >> C", and >> >> be folded to constant "8". But this folding will fail if both v1 and v2 >> >> have >> >> multiple uses, as the following code. >> >> >> >> int foo (int n) >> >> { >> >> int t1 = 8 * n; >> >> int t2 = 8 * (n - 1); >> >> >> >> use_fn (t1, t2); >> >> return t1 - t2; >> >> } >> >> >> >> Given an expression with non-single-use operands, folding it will >> >> introduce >> >> duplicated computation in most situations, and is deemed to be >> >> unprofitable. >> >> But it is always beneficial if final result is a constant or existing >> >> SSA value. >> >> >> >> And the rule is: >> >> (simplify >> >>(plusminus (mult:cs@3 @0 @1) (mult:cs@4 @0 @2)) >> >>(if ((!ANY_INTEGRAL_TYPE_P (type) >> >> || TYPE_OVERFLOW_WRAPS (type) >> >> || (INTEGRAL_TYPE_P (type) >> >> && tree_expr_nonzero_p (@0) >> >> && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION >> >> (type) >> >>/* If @1 +- @2 is constant require a hard single-use on either >> >> original operand (but not on both). */ >> >>&& (single_use (@3) || single_use (@4))) <- control whether >> >> match or not >> >> (mult (plusminus @1 @2) @0))) >> >> >> >> Current matcher only provides a way to check something before folding, >> >> but no mechanism to affect decision after folding. If has, for the above >> >> case, we can let it go when we find result is a constant. >> > >> > :s already has a counter-measure where it still folds if the output is at >> > most one operation. So this transformation has a counter-counter-measure >> > of checking single_use explicitly. And now we want a counter^3-measure... >> > >> Counter-measure is key factor to matching-cost. ":s" seems to be somewhat >> coarse-grained. And here we do need more control over it. >> >> But ideally, we could decouple these counter-measures from definitions of >> match-rule, and let gimple-matcher get a more reasonable match-or-not >> decision based on these counters. Anyway, it is another story. >> >> >> Like the way to describe input operand using flags, we could also add >> >> a new flag to specify this kind of constraint on output that we expect >> >> it is a simple gimple value. >> >> >> >> Proposed syntax is >> >> >> >> (opcode:v{ condition } ) >> >> >> >> The char "v" stands for gimple value, if more descriptive, other char is >> >> preferred. "condition" enclosed by { } is an optional c-syntax condition >> >> expression. If present, only when "condition" is met, matcher will check >> >> whether folding result is a gimple value using >> >> gimple_simplified_result_is_gimple_val (). >> >> >> >> Since there is no SSA concept in GENERIC, this is only for GIMPLE-match, >> >> not GENERIC-match. >> >> >> >> With this syntax, the rule is changed to >> >> >> >> #Form 1: >> >> (simplify >> >>(plusminus (mult:cs@3 @0 @1) (mult:cs@4 @0 @2)) >> >>(if ((!ANY_INTEGRAL_TYPE_P (type) >> >> || TYPE_OVERFLOW_WRAPS (type) >> >> || (INTEGRAL_TYPE_P (type) >> >> && tree_expr_nonzero_p (@0) >> >> && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION >> >> (type)) >> >>( if (!single_use (@3) && !single_use (@4)) >> >> (mult:v (plusminus @1 @2) @0))) >> >> (mult (plusminus @1 @2) @0) >> > >> > That seems to match what you can do with '!' now (that's very recent). > > It's also what :s does but a slight bit more "local". When any operand is > marked :s and it has more than a single-use we only allow simplifications > that do not require insertion of extra stmts. So basically the above pattern > doesn't behave any different than if you omit your :v. Only if you'd > place :v on an inner expression there would be a difference. Correlating > the inner expression we'd not want to insert new expressions for with > a specific :s (or multiple ones) would be a more natural extension of what > :s provides. > > Thus, for the above case (Form 1), you do not need :v at all and :s works. Between ":s" and ":v", there is a subtle difference. ":s" only ensures interior transform does not insert any new stmts, but this is not true for final one. Code snippet generated for (A * C) +- (B * C) -> (A+-B) * C: gimple_seq *lseq = seq; if (lseq && (!single_use (captures[0]) || !single_use (captures[3]))) lseq = NULL; if (__builtin_expect (!dbg_cnt (match), 0)) goto next_after_fail621; if (__builtin_expect (dump_file && (dump_flags & TDF_FOLDING), 0)) fprintf (dump_file, "Applying patte
Re: [RFC] Add new flag to specify output constraint in match.pd
>> >> > >> >> >> >> >> There is a match-folding issue derived from pr94234. A piece of >> >> >> code like: >> >> >> >> >> >> int foo (int n) >> >> >> { >> >> >> int t1 = 8 * n; >> >> >> int t2 = 8 * (n - 1); >> >> >> >> >> >> return t1 - t2; >> >> >> } >> >> >> >> >> >> It can be perfectly caught by the rule "(A * C) +- (B * C) -> (A +- >> >> >> B) * C", and >> >> >> be folded to constant "8". But this folding will fail if both v1 and >> >> >> v2 have >> >> >> multiple uses, as the following code. >> >> >> >> >> >> int foo (int n) >> >> >> { >> >> >> int t1 = 8 * n; >> >> >> int t2 = 8 * (n - 1); >> >> >> >> >> >> use_fn (t1, t2); >> >> >> return t1 - t2; >> >> >> } >> >> >> >> >> >> Given an expression with non-single-use operands, folding it will >> >> >> introduce >> >> >> duplicated computation in most situations, and is deemed to be >> >> >> unprofitable. >> >> >> But it is always beneficial if final result is a constant or existing >> >> >> SSA value. >> >> >> >> >> >> And the rule is: >> >> >> (simplify >> >> >>(plusminus (mult:cs@3 @0 @1) (mult:cs@4 @0 @2)) >> >> >>(if ((!ANY_INTEGRAL_TYPE_P (type) >> >> >> || TYPE_OVERFLOW_WRAPS (type) >> >> >> || (INTEGRAL_TYPE_P (type) >> >> >> && tree_expr_nonzero_p (@0) >> >> >> && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION >> >> >> (type) >> >> >>/* If @1 +- @2 is constant require a hard single-use on either >> >> >> original operand (but not on both). */ >> >> >>&& (single_use (@3) || single_use (@4))) <- control >> >> >> whether match or not >> >> >> (mult (plusminus @1 @2) @0))) >> >> >> >> >> >> Current matcher only provides a way to check something before folding, >> >> >> but no mechanism to affect decision after folding. If has, for the >> >> >> above >> >> >> case, we can let it go when we find result is a constant. >> >> > >> >> > :s already has a counter-measure where it still folds if the output is >> >> > at >> >> > most one operation. So this transformation has a counter-counter-measure >> >> > of checking single_use explicitly. And now we want a >> >> > counter^3-measure... >> >> > >> >> Counter-measure is key factor to matching-cost. ":s" seems to be somewhat >> >> coarse-grained. And here we do need more control over it. >> >> >> >> But ideally, we could decouple these counter-measures from definitions of >> >> match-rule, and let gimple-matcher get a more reasonable match-or-not >> >> decision based on these counters. Anyway, it is another story. >> >> >> >> >> Like the way to describe input operand using flags, we could also add >> >> >> a new flag to specify this kind of constraint on output that we expect >> >> >> it is a simple gimple value. >> >> >> >> >> >> Proposed syntax is >> >> >> >> >> >> (opcode:v{ condition } ) >> >> >> >> >> >> The char "v" stands for gimple value, if more descriptive, other char >> >> >> is >> >> >> preferred. "condition" enclosed by { } is an optional c-syntax >> >> >> condition >> >> >> expression. If present, only when "condition" is met, matcher will >> >> >> check >> >> >> whether folding result is a gimple value using >> >> >> gimple_simplified_result_is_gimple_val (). >> >> >> >> >> >> Since there is no SSA concept in GENERIC, this is only for >> >> >> GIMPLE-match, >> >> >> not GENERIC-match. >> >> >> >> >> >> With this syntax, the rule is changed to >> >> >> >> >> >> #Form 1: >> >> >> (simplify >> >> >>(plusminus (mult:cs@3 @0 @1) (mult:cs@4 @0 @2)) >> >> >>(if ((!ANY_INTEGRAL_TYPE_P (type) >> >> >> || TYPE_OVERFLOW_WRAPS (type) >> >> >> || (INTEGRAL_TYPE_P (type) >> >> >> && tree_expr_nonzero_p (@0) >> >> >> && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION >> >> >> (type)) >> >> >>( if (!single_use (@3) && !single_use (@4)) >> >> >> (mult:v (plusminus @1 @2) @0))) >> >> >> (mult (plusminus @1 @2) @0) >> >> > >> >> > That seems to match what you can do with '!' now (that's very recent). >> > >> > It's also what :s does but a slight bit more "local". When any operand is >> > marked :s and it has more than a single-use we only allow simplifications >> > that do not require insertion of extra stmts. So basically the above >> > pattern >> > doesn't behave any different than if you omit your :v. Only if you'd >> > place :v on an inner expression there would be a difference. Correlating >> > the inner expression we'd not want to insert new expressions for with >> > a specific :s (or multiple ones) would be a more natural extension of what >> > :s provides. >> > >> > Thus, for the above case (Form 1), you do not need :v at all and :s works. >> >> Between ":s" and ":v", there is a subtle difference. ":s" only ensures >> interior >> transform does not insert any new stmts, but this is not true for final one. >> >> Code s
[RFC] A memory gathering optimization for loop
1. Background In a loop, it is optimal if only one memory stream is activated, that is, all memory operations sequentially access one data region. But that is always not the case, such as traversing link list and manipulating discrete arrays. In this scenario, the loop would contain multiple scattered memory accesses, which could trigger inefficient multi-way hardware prefetching, thus making cpu cache hit rate drop. The tracker pr98598 (https://gcc.gnu.org/bugzilla//show_bug.cgi?id=98598) gives a description on one related problem that we are meant to target. 2. Memory gathering optimization (MGO) For nested loops, if scattered memory accesses inside inner loop remain unchanged in each iteration of outer loop, we can dynamically gather result data into a sequential memory region when the first access in the inner loop takes place. This way the hardware prefetching can be reduced, and cpu cache hit rate can be improved. We name this optimization MGO (memory gathering optimization). An example is given as below, which is a little different from pr98598, and could not be optimized by loop distribution. struct A { int **p1; }; int foo (int n, struct A *array) { int sum = 0; for (int i = 0; i < n; i++) { for (int j = 0; j < i; j++) { /* iteration count is i */ int **p1 = array[j].p1; int *p2 = *p1; sum += *p2; } } return sum; } Although this example seems to be pretty artificial, the kind of data reuse in nested loops is often seen in real applications, especially written by Fortran, and C++ STL. To simplify explanation of this memory gathering optimization, pseudo code on original nested loops is given as: outer-loop ( ) { /* data in inner loop are also invariant in outer loop. */ ... inner-loop (iter, iter_count) { /* inner loop to apply MGO */ Type1 v1 = LOAD_FN1 (iter); Type2 v2 = LOAD_FN2 (v1); Type3 v3 = LOAD_FN3 (v2); ... iter = NEXT_FN (iter); } ... } "LOAD_FNx()" is a conceptual function that translates its argument to a result, and is composed of a sequence of operations in which there is only one memory load. It is capable of representing simple memory dereference like "iter->field", or more complicated expression like "array[iter * 2 + cst].field". Likewise, "NEXT_FN()" is also a conceptual function which defines how an iterator is advanced. It is able to describe simple integer iterator, list iterator, or even pure-function-based iterator on advanced stuff like hashset/tree. Then the code will be transformed to: typedef struct cache_elem { bool init; Type1 c_v1; Type2 c_v2; Type3 c_v3; } cache_elem; cache_elem *cache_arr = calloc (iter_count, sizeof(cache_elem)); outer-loop () { ... size_t cache_idx = 0; inner-loop (iter, iter_count) { if (!cache_arr[cache_idx]->init) { v1 = LOAD_FN1 (iter); v2 = LOAD_FN2 (v1); v3 = LOAD_FN3 (v2); cache_arr[cache_idx]->init = true; cache_arr[cache_idx]->c_v1 = v1; cache_arr[cache_idx]->c_v2 = v2; cache_arr[cache_idx]->c_v3 = v3; } else { v1 = cache_arr[cache_idx]->c_v1; v2 = cache_arr[cache_idx]->c_v2; v3 = cache_arr[cache_idx]->c_v3; } ... cache_idx++; iter = NEXT_FN (iter); } ... } free (cache_arr); In the transformation, cache space belongs to compiler-generated temporary storage, but it is from user heap. We find this trick is very uncommon in gcc. We’d like to make sure whether there are any special considerations or restrictions on this. Anyway, using alloca() to get space from stack is an alternative, but we should be cautious to avoid incurring stack overflow by non-user logic. "iter_count" stands for an upper bound for inner-loop iteration count, which must be an outer-loop invariant expression, in that it determines data count in cache space, and the space is allocated before outer-loop. In the example below, we can simply get such bound for inner-loop 1 from its niter_desc, while for inner-loop 2, it is not that easy, but it is possible to infer that the loop's iteration count never exceeds "n" via comparison predicate information. foo (int n, int m) { for (int i = 0; i < n; i++ ) { /* outer-loop 1 */ ... for (int j = 0; j < m; j++) { /* inner-loop 1 */ ... } ... } for (int i = 0; i < n; i ++) { /* outer-loop 2 */ ... for (int j = 0; j < i; j++) { /* inner-loop 2 */ ... } ... } } 3. Cache count threshold We expect cache space works as normal stack temporary, and would not impact execution of program so much, especially when heap
[RFC] A memory gathering optimization for loop
1. Background In a loop, it is optimal if only one memory stream is activated, that is, all memory operations sequentially access one data region. But that is always not the case, such as traversing link list and manipulating discrete arrays. In this scenario, the loop would contain multiple scattered memory accesses, which could trigger inefficient multi-way hardware prefetching, thus making cpu cache hit rate drop. The tracker pr98598 (https://gcc.gnu.org/bugzilla//show_bug.cgi?id=98598) gives a description on one related problem that we are meant to target. 2. Memory gathering optimization (MGO) For nested loops, if scattered memory accesses inside inner loop remain unchanged in each iteration of outer loop, we can dynamically gather result data into a sequential memory region when the first access in the inner loop takes place. This way the hardware prefetching can be reduced, and cpu cache hit rate can be improved. We name this optimization MGO (memory gathering optimization). An example is given as below, which is a little different from pr98598, and could not be optimized by loop distribution. struct A { int **p1; }; int foo (int n, struct A *array) { int sum = 0; for (int i = 0; i < n; i++) { for (int j = 0; j < i; j++) { /* iteration count is i */ int **p1 = array[j].p1; int *p2 = *p1; sum += *p2; } } return sum; } Although this example seems to be pretty artificial, the kind of data reuse in nested loops is often seen in real applications, especially written by Fortran, and C++ STL. To simplify explanation of this memory gathering optimization, pseudo code on original nested loops is given as: outer-loop ( ) { /* data in inner loop are also invariant in outer loop. */ ... inner-loop (iter, iter_count) { /* inner loop to apply MGO */ Type1 v1 = LOAD_FN1 (iter); Type2 v2 = LOAD_FN2 (v1); Type3 v3 = LOAD_FN3 (v2); ... iter = NEXT_FN (iter); } ... } "LOAD_FNx()" is a conceptual function that translates its argument to a result, and is composed of a sequence of operations in which there is only one memory load. It is capable of representing simple memory dereference like "iter->field", or more complicated expression like "array[iter * 2 + cst].field". Likewise, "NEXT_FN()" is also a conceptual function which defines how an iterator is advanced. It is able to describe simple integer iterator, list iterator, or even pure-function-based iterator on advanced stuff like hashset/tree. Then the code will be transformed to: typedef struct cache_elem { bool init; Type1 c_v1; Type2 c_v2; Type3 c_v3; } cache_elem; cache_elem *cache_arr = calloc (iter_count, sizeof(cache_elem)); outer-loop () { ... size_t cache_idx = 0; inner-loop (iter, iter_count) { if (!cache_arr[cache_idx]->init) { v1 = LOAD_FN1 (iter); v2 = LOAD_FN2 (v1); v3 = LOAD_FN3 (v2); cache_arr[cache_idx]->init = true; cache_arr[cache_idx]->c_v1 = v1; cache_arr[cache_idx]->c_v2 = v2; cache_arr[cache_idx]->c_v3 = v3; } else { v1 = cache_arr[cache_idx]->c_v1; v2 = cache_arr[cache_idx]->c_v2; v3 = cache_arr[cache_idx]->c_v3; } ... cache_idx++; iter = NEXT_FN (iter); } ... } free (cache_arr); In the transformation, cache space belongs to compiler-generated temporary storage, but it is from user heap. We find this trick is very uncommon in gcc. We’d like to make sure whether there are any special considerations or restrictions on this. Anyway, using alloca() to get space from stack is an alternative, but we should be cautious to avoid incurring stack overflow by non-user logic. "iter_count" stands for an upper bound for inner-loop iteration count, which must be an outer-loop invariant expression, in that it determines data count in cache space, and the space is allocated before outer-loop. In the example below, we can simply get such bound for inner-loop 1 from its niter_desc, while for inner-loop 2, it is not that easy, but it is possible to infer that the loop's iteration count never exceeds "n" via comparison predicate information. foo (int n, int m) { for (int i = 0; i < n; i++ ) { /* outer-loop 1 */ ... for (int j = 0; j < m; j++) { /* inner-loop 1 */ ... } ... } for (int i = 0; i < n; i ++) { /* outer-loop 2 */ ... for (int j = 0; j < i; j++) { /* inner-loop 2 */ ... } ... } } 3. Cache count threshold We expect cache space works as normal stack temporary, and would not impact execution of program so much, especially when heap
Re: [RFC] A memory gathering optimization for loop
>> -Original Message- >> From: Feng Xue OS >> Sent: Thursday, January 14, 2021 12:28 PM >> To: gcc@gcc.gnu.org >> Cc: JiangNing OS ; Hao Liu OS >> >> Subject: [RFC] A memory gathering optimization for loop >> >> 1. Background >> >> In a loop, it is optimal if only one memory stream is activated, that is, all >> memory operations sequentially access one data region. But that is always >> not the case, such as traversing link list and manipulating discrete arrays. >> In >> this scenario, the loop would contain multiple scattered memory accesses, >> which could trigger inefficient multi-way hardware prefetching, thus making >> cpu cache hit rate drop. The tracker pr98598 >> (https://gcc.gnu.org/bugzilla//show_bug.cgi?id=98598) >> gives a description on one related problem that we are meant to target. >> >> 2. Memory gathering optimization (MGO) >> >> For nested loops, if scattered memory accesses inside inner loop remain >> unchanged in each iteration of outer loop, we can dynamically gather result >> data into a sequential memory region when the first access in the inner loop >> takes place. This way the hardware prefetching can be reduced, and cpu >> cache hit rate can be improved. We name this optimization MGO (memory >> gathering optimization). An example is given as below, which is a little >> different from pr98598, and could not be optimized by loop distribution. >> >> struct A { int **p1; }; >> >> int foo (int n, struct A *array) >> { >> int sum = 0; >> >> for (int i = 0; i < n; i++) { >> for (int j = 0; j < i; j++) { /* iteration count is i */ >> int **p1 = array[j].p1; >> int *p2 = *p1; >> >> sum += *p2; >> } >> } >> >> return sum; >> } >> >> > The number of memory streams in this loop is limited (practically 1), how > would the transform be affected when issuing prefetches for array at a well > placed spot? > > That is, what's the hazard we avoid (besides the obvious data dependence > Shortening in case the load from array is in cache?) > > Richard. > In the example, to evaluate "sum += *p2" at each iteration, three memory loads will be involved, each of them forms a memory stream (then totally 3 at least, not 1) since they touch three discrete regions. Spatial locality only exists for the first load "array[j].p1", some kind of prefetch either via hardware or software works well for it, while for other two, prefetch would be of not use and cpu cache could not speculate where the other two loads will go. So intention of mgo is to put those discontinuous and unpredictable memory accesses together to form one new sequential memory stream in favor of cpu cache. And if we statically find hazard due to certain write dependence, mgo will not be applied. Thanks, Feng >> Although this example seems to be pretty artificial, the kind of data reuse >> in >> nested loops is often seen in real applications, especially written by >> Fortran, >> and C++ STL. >> >> To simplify explanation of this memory gathering optimization, pseudo code >> on original nested loops is given as: >> >> outer-loop ( ) { /* data in inner loop are also invariant in outer loop. >> */ >> ... >> inner-loop (iter, iter_count) { /* inner loop to apply MGO */ >> >> Type1 v1 = LOAD_FN1 (iter); >> Type2 v2 = LOAD_FN2 (v1); >> Type3 v3 = LOAD_FN3 (v2); >> ... >> iter = NEXT_FN (iter); >> } >> >> ... >> } >> >> "LOAD_FNx()" is a conceptual function that translates its argument to a >> result, >> and is composed of a sequence of operations in which there is only one >> memory load. It is capable of representing simple memory dereference like >> "iter->field", or more complicated expression like "array[iter * 2 + >> cst].field". >> >> Likewise, "NEXT_FN()" is also a conceptual function which defines how an >> iterator is advanced. It is able to describe simple integer iterator, list >> iterator, >> or even pure-function-based iterator on advanced stuff like hashset/tree. >> >> Then the code will be transformed to: >> >> typedef struct cache_elem { >> bool init; >> Type1 c_v1; >> Type2 c_v2; >> Type3 c_v3; >> } cache_elem; >> >> cache_el
Re: [RFC] A memory gathering optimization for loop
>> To simplify explanation of this memory gathering optimization, pseudo >> code on original nested loops is given as: >> >> outer-loop ( ) { /* data in inner loop are also invariant in outer loop. >> */ >> ... >> inner-loop (iter, iter_count) { /* inner loop to apply MGO */ >> >> Type1 v1 = LOAD_FN1 (iter); >> Type2 v2 = LOAD_FN2 (v1); >> Type3 v3 = LOAD_FN3 (v2); >> ... >> iter = NEXT_FN (iter); >> } >> >> ... >> } >> >> "LOAD_FNx()" is a conceptual function that translates its argument to a >> result, and is composed of a sequence of operations in which there is >> only one memory load. It is capable of representing simple memory >> dereference like "iter->field", or more complicated expression like >> "array[iter * 2 + cst].field". >> >> Likewise, "NEXT_FN()" is also a conceptual function which defines how an >> iterator is advanced. It is able to describe simple integer iterator, >> list iterator, or even pure-function-based iterator on advanced stuff >> like hashset/tree. >> >> Then the code will be transformed to: >> >> typedef struct cache_elem { >> bool init; >> Type1 c_v1; >> Type2 c_v2; >> Type3 c_v3; >> } cache_elem; >> >> cache_elem *cache_arr = calloc (iter_count, sizeof(cache_elem)); >> >> outer-loop () { >> ... >> size_t cache_idx = 0; >> >> inner-loop (iter, iter_count) { >> >> if (!cache_arr[cache_idx]->init) { >> v1 = LOAD_FN1 (iter); >> v2 = LOAD_FN2 (v1); >> v3 = LOAD_FN3 (v2); >> >> cache_arr[cache_idx]->init = true; >> cache_arr[cache_idx]->c_v1 = v1; >> cache_arr[cache_idx]->c_v2 = v2; >> cache_arr[cache_idx]->c_v3 = v3; >> } >> else { >> v1 = cache_arr[cache_idx]->c_v1; >> v2 = cache_arr[cache_idx]->c_v2; >> v3 = cache_arr[cache_idx]->c_v3; >> } >> ... >> cache_idx++; >> iter = NEXT_FN (iter); >> } >> ... >> } >> >> free (cache_arr); > > Why do you need the init field? Isn't the array initialized > sequentially? This is meant to support more complicate pattern, in which actual iteration count of inner_loop depends on outer_loop, and "iter_count" only represents an up-bound of the count. for (i = 0; i < N; i++) { for (j = 0; j < i; j++) { Type1 v1 = LOAD_FN1 (j); Type2 v2 = LOAD_FN2 (v1); Type3 v3 = LOAD_FN3 (v2); ... condition = ... } if (condition) break; } In above case, "iter_count" is N, and is not exact count of inner_loop. Before going through all loads, loop execution might end up with certain condition. We should not cache all of them in one step since some loads might be invalid if condition is not met. So we need a "init" field to track whether a caching element got properly filled, and allows runtime switch between "do caching" and "use caching". > You need to keep the original loop and use it if calloc fails. The > transformation is not valid in general because calloc is not an > async-signal-safe function, and GCC shouldn't introduce such calls if > they are not present in the original source code. For > -fnon-call-exceptions, you need to introduce unwinding code that calls > free. This is an issue that we missed. The only way left is to use alloca(), but may be not suitable for large data caching. > Can you change this optimization so that it can use a fixed-size buffer? > This would avoid all issues around calling calloc. If iter_count can be > very large, allocating that much extra memory might not be beneficial > anyway. And need additional code to synchronize access to this buffer. Thanks, Feng
Question about non-POD class type
Sorry, sent to wrong mail list. From: Feng Xue OS Sent: Friday, May 14, 2021 4:30 PM To: gcc-patc...@gcc.gnu.org Subject: Question about non-POD class type For an instance of a non-POD class, can I always assume that any operation on it should be type-safe, any wrong or even trick code to violate this is UB in C++ spec? For example, here are some ways: union { Type1 *p1; Type2 *p2; }; or union { Type1 t1; Type2 t2; }; or void *p = Type1 *p1; Type2 *p2 = p; p2->xxx; Feng
[RFC] Whole Program Devirtualization
1. Background C++ Devirtualization is meant to determine all possible class type identities of a virtual dynamic call, and try to transform it to direct static call, or certain kind of statements with lower runtime cost. Current implementation in GCC tends to adopt a somewhat conservative scheme, originated from a safe but defensive assumption that class hierarchy of a type (except for FINAL class or class in anonymous namespace) is incomplete, and might be extended by external library and module. Therefore, most of qualified virtual calls would be partially devirtualized to a speculative form of direct call as the following, guarded by a function address check and a failsafe branch. if (obj->vfn_ptr == &T::FN) obj->T::FN(); else obj->vfn_ptr(); Here, full devirtualization is required if we want to get vcalls concisely transformed with no guard test. In theory, when whole program compilation is enforced, we could enable such kind of devirtualization based on the closed-world assumption about contained class types. But "whole program" does not ensure that class hierarchy of a type never span to dependent C++ libraries (one is libstdc++), which would result in incorrect devirtualization. An example is given below to demonstrate the problem. // Has been pre-compiled to a library class Base { virtual void method() = 0; }; class Derive_in_Library : public Base { virtual void method() { ... } }; Base *get_object() { return new Derive_in_Library(); } --- // User source code to compile class Base { virtual void method() = 0; }; class Derive : public Base { virtual void method() { ... } }; void foo() { Base *obj = get_object(); obj->method(); } If there is no way to inspect entire class hierarchy comprised by relevant types in the library and user source code, whole program analysis would improperly think of 'Derive' as sole descendant of 'Base', and triggers devirtualizing 'obj->method' to 'Derive::method', which is definitely unexpected transformation. To target this complication, we should figure out those types that are referenced by regular object file and library, and exclude them from candidates of full devirtualization. This RFC will discuss some implementable approaches to identify whether type is qualified for full devirtualization under whole program assumption (so also called whole- program devirtualization, WPD for short), and new optimizations inspired by whole program type analysis. 2. Whole-program local type In this RFC, we merely care about polymorphic class (class contains virtual function), since only this kind of type is meaningful to devirtualization. A class type is identified as local in terms of whole-program scope iff there is no instantiation of the type or its derived types in regular object files and libraries to link with. A whole-program local type is safe to apply WPD. Here are two mechanisms for us to perform this check on a given type: 2.1 Typeinfo symbol resolution by linker-plugin Essentially, devirtualization is driven by analysis on vtable lifetime, which begins once the vtable is attached to a new complete object or sub-object during instantiation of a class or its derived class. Seemingly, we could find whether a class type is referenced in object file or library by tracking availability of its vtable symbol. But vtable might be purged and we are still interested in its belonging class type. Refer to the library code in above example, vtable of 'Base' is unused since it neither participate construction of 'Base' nor 'Derive_in_Library', but we still must know if 'Base' is live in the library to ensure correct result. Moreover, once a class is virtual inherited, it will have multiple vtables (including construction vtables), but the way of looking up class via vtable symbol requires one-to-one correspondence between them, then it does not work. Beside vtable symbol, class instantiation also creates references to typeinfo symbols of itself and all its parent classes. At the same time, each class has unique typeinfo, which could act as a desirable type marker, and be used to perform type lookup in object file and library. Anyway, the approach is not 100% safe, specifically, when typeinfo symbols are invisible or missed, for example, when libraries to link against was built with -fno-weak or -fno-rtti. But at least for libstc++, these symbols should be present for the sake of allowing dynamic_cast in user source code. Lto-linker-plugin will work with linker to do a whole-program symbol resolution before LTO/WPA happens, and attach binding information to symbols. Given a typeinfo symbol, if it is resolved as LDPR_PREVAILING_DEF_IRONLY (only referenced from IR code, with no references from regular objects), its corresponding class type is deemed to be whole-pro
Re: GCC [RFC] Whole Program Devirtualization
We are not going to create a new devirtualization framework from scratch, just hope it to be an enhancement on current speculative devirtualization. The process does not need parse native code in library, but only resort to existing lightweight symbol resolution by LTO-prelinker. And C++ virtual dispatching is expected to be translated to gimple IR from C++ source, if user attempts to hand-craft those using embedded ASMs, it should be considered as an UB to C++ ABI. Compile time of whole-program analysis is not that terrible as you think, basically, it is realistically acceptable even base code is very large. As I know, google enables WPD in building of chrome, while it is based on llvm. Thanks, Feng From: Basile Starynkevitch Sent: Friday, August 20, 2021 8:36 PM To: Feng Xue OS Cc: basile.starynkevi...@cea.fr; gcc@gcc.gnu.org Subject: GCC [RFC] Whole Program Devirtualization Hello Feng Xue OS Your project is interesting, but ambitious. I think the major points are: whole program analysis. Static analysis tools like https://frama-c.com/ or https://github.com/bstarynk/bismon/ could be relevant. Projects like https://www.decoder-project.eu/ could be relevant. With cross-compilation, things are becoming harder. abstract interpretation might be relevant (but difficult and costly to implement). See wikipedia. size of the whole program which is analyzed. If the entire program (including system libraries like libc) has e.g. less than ten thousand routines and less than a million GIMPLE instructions in total, it make sense. But if the entire program is as large as the Linux kernel, or the GCC compiler, or the Firefox browser (all have many millions lines of source code) you probably won't be able to do whole program devirtualization in a few years of human work. computed gotos or labels as values (see https://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html for more) are making this difficult. But they do exist, and probably could be hidden in GNU glibc or libstdc++ internal code. asm statements are difficult. They usually appear inside your libc. How would you deal with them? Can you afford a month of computer time to compile a large software with your whole program devirtualizer? In most cases, not, but Pitrat's book Artificial Beings - the conscience of a conscious machine (ISBN 9781848211018) suggest cases where it might make sense (he is explaining a "compiler like system" which runs for a month of CPU time). My recommendation would be to code first a simple GCC plugin as a proof of concept thing, which reject programs which could not be realistically devirtualized, and store somewhere (in some database perhaps) a representation of them otherwise. I worked 3 years full time on https://github.com/bstarynk/bismon/ to achieve a similar goal (and I don't claim to have succeeded, and I don't have any more funding). My guess is that some code could be useful to you (then contact me by email both at work basile.starynkevi...@cea.fr<mailto:basile.starynkevi...@cea.fr> and at home bas...@starynkevitch.net<mailto:bas...@starynkevitch.net> ) The most important thing: limit your ambition at first. Write a document (at least an internal one) stating what you won't do. Cheers -- Basile Starynkevitch <mailto:bas...@starynkevitch.net> (only mine opinions / les opinions sont miennes uniquement) 92340 Bourg-la-Reine, France web page: starynkevitch.net/Basile/
PING: [RFC] Whole Program Devirtualization
Honza, How do you think about proposal in this RFC? Thanks a lot. Best Regards, Feng From: Martin Liška Sent: Friday, August 20, 2021 9:45 PM To: Feng Xue OS; gcc@gcc.gnu.org Cc: JiangNing OS; Jan Hubicka Subject: Re: [RFC] Whole Program Devirtualization ... adding Honza to CC who spent quite some time on devirtualization pass. Martin
PING^2: [RFC] Whole Program Devirtualization
Ping again. Thanks, Feng From: Feng Xue OS Sent: Wednesday, September 1, 2021 9:46 AM To: Martin Liška; Jan Hubicka; gcc@gcc.gnu.org Cc: JiangNing OS Subject: PING: [RFC] Whole Program Devirtualization Honza, How do you think about proposal in this RFC? Thanks a lot. Best Regards, Feng From: Martin Liška Sent: Friday, August 20, 2021 9:45 PM To: Feng Xue OS; gcc@gcc.gnu.org Cc: JiangNing OS; Jan Hubicka Subject: Re: [RFC] Whole Program Devirtualization ... adding Honza to CC who spent quite some time on devirtualization pass. Martin