Re:outdoor LED
Hi My friend, Thanks for reading. Have a nice day! This is Matt from Sontec lighting, we have 20+ years in led. WHAT WE DO: OEM led lighting, focus on led highbay, shoebox light, wall pack and packing lot led light. How I can help you: 1, Increase your sales turnover through reasonable price and better quality. 2, Help you build a strong products to pull ahead of your competition. 3, Special 5 days delivery program on customized orders. Why I can help you? ★ 20 years working with overseas buyer in international business Field, including led lighting top brands worldwide. ★ Special Customized order at fast delivery. ★ 86,000 sq ft production base with 100+ workers Pls contact us now for details. With Best Regards, Cathy China: 86-13822349468 Your China best OEM supplier Sontec LED lighting
Re: Your China LED factory
Hi My friend, Thanks for reading. Have a nice day! This is Matt from Sontec lighting, we have 20+ years in led. WHAT WE DO: OEM led lighting, focus on led highbay, shoebox light, wall pack and packing lot led light. How I can help you: 1, Increase your sales turnover through reasonable price and better quality. 2, Help you build a strong products to pull ahead of your competition. 3, Special 5 days delivery program on customized orders. Why I can help you? ★ 20 years working with overseas buyer in international business Field, including led lighting top brands worldwide. ★ Special Customized order at fast delivery. ★ 86,000 sq ft production base with 100+ workers Pls contact us now for details. With Best Regards, Cathy China: 86-13822349468 Your China best OEM supplier Sontec LED lighting
Re: How to extend SLP to support this case
On Tue, Mar 10, 2020 at 7:52 AM Kewen.Lin wrote: > > Hi all, > > I'm investigating whether GCC can vectorize the below case on ppc64le. > > extern void test(unsigned int t[4][4]); > > void foo(unsigned char *p1, int i1, unsigned char *p2, int i2) > { > unsigned int tmp[4][4]; > unsigned int a0, a1, a2, a3; > > for (int i = 0; i < 4; i++, p1 += i1, p2 += i2) { > a0 = (p1[0] - p2[0]) + ((p1[4] - p2[4]) << 16); > a1 = (p1[1] - p2[1]) + ((p1[5] - p2[5]) << 16); > a2 = (p1[2] - p2[2]) + ((p1[6] - p2[6]) << 16); > a3 = (p1[3] - p2[3]) + ((p1[7] - p2[7]) << 16); > > int t0 = a0 + a1; > int t1 = a0 - a1; > int t2 = a2 + a3; > int t3 = a2 - a3; > > tmp[i][0] = t0 + t2; > tmp[i][2] = t0 - t2; > tmp[i][1] = t1 + t3; > tmp[i][3] = t1 - t3; > } > test(tmp); > } > > With unlimited costs, I saw loop aware SLP can vectorize it but with very > inefficient codes. It builds the SLP instance from store group {tmp[i][0] > tmp[i][1] tmp[i][2] tmp[i][3]}, builds nodes {a0, a0, a0, a0}, > {a1, a1, a1, a1}, {a2, a2, a2, a2}, {a3, a3, a3, a3} after parsing operands > for tmp* and t*. It means it's unable to make the isomorphic group for > a0, a1, a2, a3, although they appears isomorphic to merge. Even if it can > recognize over_widening pattern and do some parallel for two a0 from two > iterations, but it's still inefficient (high cost). > > In this context, it looks better to build first by > leveraging isomorphic computation trees constructing them, eg: > w1_0123 = load_word(p1) > V1_0123 = construct_vec(w1_0123) > w1_4567 = load_word(p1 + 4) > V1_4567 = construct_vec(w1_4567) > w2_0123 = load_word(p2) > V2_0123 = construct_vec(w2_0123) > w2_4567 = load_word(p2 + 4) > V2_4567 = construct_vec(w2_4567) > V_a0123 = (V1_0123 - V2_0123) + (V1_4567 - V2_4567)<<16 > > But how to teach it to be aware of this? Currently the processing starts > from bottom to up (from stores), can we do some analysis on the SLP > instance, detect some pattern and update the whole instance? In theory yes (Tamar had something like that for AARCH64 complex rotations IIRC). And yes, the issue boils down to how we handle SLP discovery. I'd like to improve SLP discovery but it's on my list only after I managed to get rid of the non-SLP code paths. I have played with some ideas (even produced hackish patches) to find "seeds" to form SLP groups from using multi-level hashing of stmts. My plan is to rewrite SLP discovery completely, starting from a SLP graph that 1:1 reflects the SSA use-def graph (no groups formed yet) and then form groups from seeds and insert "connection" nodes (merging two subgraphs with less lanes into one with more lanes, splitting lanes, permuting lanes, etc.). Currently I'm working on doing exactly this but only for SLP loads (because that's where it's most difficult...). > Besides, I also tried whether the standalone SLP pass can handle it, > it failed to. It stops at a* due to incompatible vector types. > > The optimal vectorized SLP codes can be: > // p1 byte 0 to byte 7 > d1_0_7 = load_dword(p1) > // p1+i1 b0 to b7, rename it as 8 to 15 > d1_8_15 = load_dword(p1 + i1) > d1_16_23 = load_dword(p1 + 2*i1) > d1_24_31 = load_dword(p1 + 3*i1) > > V_d1_0_15 = construct_vec(d1_0_7,d1_8_15) // vector char > V_d1_16_31 = construct_vec(d1_16_23,d1_24_31) > V_d1_0_3_all = vperm(V_d1_0_15, V_d1_0_15, > {0 8 16 24 1 9 17 25 2 10 18 26 3 11 19 27}) > V_d1_4_7_all = vperm(V_d1_0_15, V_d1_0_15, > {4 12 20 28 5 13 21 29 6 14 22 30 7 15 23 31}) > > // Do the similar for p2 with i2, get V_d2_0_3_all, V_d2_4_7_all > > // Do the subtraction together (all 4x4 bytes) > V_sub1 = V_d1_0_3_all - V_d2_0_3_all > V_sub2 = V_d1_4_7_all - V_d2_4_7_all > > // Do some unpack and get the promoted vector int > V_a0_tmp = vec_promote(V_sub2, {0 1 2 3}) // vector int {b4 b12 b20 b28} > V_a0_1 = V_a0_tmp << 16 > V_a0_0 = vec_promote(V_sub1, {0 1 2 3}). // vector int {b0 b8 b16 b24} > // vector int {a0_iter0, a0_iter1, a0_iter2, a0_iter3} > V_a0 = V_a0_0 + V_a0_1 > > // Get the similar for V_a1, V_a2, V_a3 > > // Compute t0/t1/t2/t3 > // vector int {t0_iter0, t0_iter1, t0_iter2, t0_iter3} > V_t0 = V_a0 + V_a1 > V_t1 = V_a0 - V_a1 > V_t2 = V_a2 + V_a3 > V_t3 = V_a2 - V_a3 > > // Compute tmps > // vector int {tmp[0][0], tmp[1][0], tmp[2][0], tmp[3][0]} > V_tmp0 = V_t0 + V_t2 > V_tmp2 = V_t0 - V_t2 > V_tmp1 = V_t1 + V_t3 > V_tmp3 = V_t1 - V_t3 > > // Final construct the {tmp[0][0], tmp[0][1], tmp[0][2], tmp[0][3]} ... > // with six further permutation on V_tmp0/V_tmp1/V_tmp2/V_tmp3 > > From the above, the key thing is to group tmp[i][j] i=/0,1,2,3/ together, eg: > tmp[i][0] i=/0,1,2,3/ (one group) > tmp[i][1] i=/0,1,2,3/ (one group) > tmp[i][2] i=/0,1,2,3/ (one group) > tmp[i][3] i=/0,1,2,3/ (one group) > > which tmp[i][j] group ha
Re: How to extend SLP to support this case
On Tue, Mar 10, 2020 at 12:12 PM Richard Biener wrote: > > On Tue, Mar 10, 2020 at 7:52 AM Kewen.Lin wrote: > > > > Hi all, > > > > I'm investigating whether GCC can vectorize the below case on ppc64le. > > > > extern void test(unsigned int t[4][4]); > > > > void foo(unsigned char *p1, int i1, unsigned char *p2, int i2) > > { > > unsigned int tmp[4][4]; > > unsigned int a0, a1, a2, a3; > > > > for (int i = 0; i < 4; i++, p1 += i1, p2 += i2) { > > a0 = (p1[0] - p2[0]) + ((p1[4] - p2[4]) << 16); > > a1 = (p1[1] - p2[1]) + ((p1[5] - p2[5]) << 16); > > a2 = (p1[2] - p2[2]) + ((p1[6] - p2[6]) << 16); > > a3 = (p1[3] - p2[3]) + ((p1[7] - p2[7]) << 16); > > > > int t0 = a0 + a1; > > int t1 = a0 - a1; > > int t2 = a2 + a3; > > int t3 = a2 - a3; > > > > tmp[i][0] = t0 + t2; > > tmp[i][2] = t0 - t2; > > tmp[i][1] = t1 + t3; > > tmp[i][3] = t1 - t3; > > } > > test(tmp); > > } > > > > With unlimited costs, I saw loop aware SLP can vectorize it but with very > > inefficient codes. It builds the SLP instance from store group {tmp[i][0] > > tmp[i][1] tmp[i][2] tmp[i][3]}, builds nodes {a0, a0, a0, a0}, > > {a1, a1, a1, a1}, {a2, a2, a2, a2}, {a3, a3, a3, a3} after parsing operands > > for tmp* and t*. It means it's unable to make the isomorphic group for > > a0, a1, a2, a3, although they appears isomorphic to merge. Even if it can > > recognize over_widening pattern and do some parallel for two a0 from two > > iterations, but it's still inefficient (high cost). > > > > In this context, it looks better to build first by > > leveraging isomorphic computation trees constructing them, eg: > > w1_0123 = load_word(p1) > > V1_0123 = construct_vec(w1_0123) > > w1_4567 = load_word(p1 + 4) > > V1_4567 = construct_vec(w1_4567) > > w2_0123 = load_word(p2) > > V2_0123 = construct_vec(w2_0123) > > w2_4567 = load_word(p2 + 4) > > V2_4567 = construct_vec(w2_4567) > > V_a0123 = (V1_0123 - V2_0123) + (V1_4567 - V2_4567)<<16 > > > > But how to teach it to be aware of this? Currently the processing starts > > from bottom to up (from stores), can we do some analysis on the SLP > > instance, detect some pattern and update the whole instance? > > In theory yes (Tamar had something like that for AARCH64 complex > rotations IIRC). And yes, the issue boils down to how we handle > SLP discovery. I'd like to improve SLP discovery but it's on my list > only after I managed to get rid of the non-SLP code paths. I have > played with some ideas (even produced hackish patches) to find > "seeds" to form SLP groups from using multi-level hashing of stmts. > My plan is to rewrite SLP discovery completely, starting from a > SLP graph that 1:1 reflects the SSA use-def graph (no groups > formed yet) and then form groups from seeds and insert > "connection" nodes (merging two subgraphs with less lanes > into one with more lanes, splitting lanes, permuting lanes, etc.). > > Currently I'm working on doing exactly this but only for SLP loads > (because that's where it's most difficult...). > > > Besides, I also tried whether the standalone SLP pass can handle it, > > it failed to. It stops at a* due to incompatible vector types. > > > > The optimal vectorized SLP codes can be: > > // p1 byte 0 to byte 7 > > d1_0_7 = load_dword(p1) > > // p1+i1 b0 to b7, rename it as 8 to 15 > > d1_8_15 = load_dword(p1 + i1) > > d1_16_23 = load_dword(p1 + 2*i1) > > d1_24_31 = load_dword(p1 + 3*i1) > > > > V_d1_0_15 = construct_vec(d1_0_7,d1_8_15) // vector char > > V_d1_16_31 = construct_vec(d1_16_23,d1_24_31) > > V_d1_0_3_all = vperm(V_d1_0_15, V_d1_0_15, > > {0 8 16 24 1 9 17 25 2 10 18 26 3 11 19 27}) > > V_d1_4_7_all = vperm(V_d1_0_15, V_d1_0_15, > > {4 12 20 28 5 13 21 29 6 14 22 30 7 15 23 31}) > > > > // Do the similar for p2 with i2, get V_d2_0_3_all, V_d2_4_7_all > > > > // Do the subtraction together (all 4x4 bytes) > > V_sub1 = V_d1_0_3_all - V_d2_0_3_all > > V_sub2 = V_d1_4_7_all - V_d2_4_7_all > > > > // Do some unpack and get the promoted vector int > > V_a0_tmp = vec_promote(V_sub2, {0 1 2 3}) // vector int {b4 b12 b20 b28} > > V_a0_1 = V_a0_tmp << 16 > > V_a0_0 = vec_promote(V_sub1, {0 1 2 3}). // vector int {b0 b8 b16 b24} > > // vector int {a0_iter0, a0_iter1, a0_iter2, a0_iter3} > > V_a0 = V_a0_0 + V_a0_1 > > > > // Get the similar for V_a1, V_a2, V_a3 > > > > // Compute t0/t1/t2/t3 > > // vector int {t0_iter0, t0_iter1, t0_iter2, t0_iter3} > > V_t0 = V_a0 + V_a1 > > V_t1 = V_a0 - V_a1 > > V_t2 = V_a2 + V_a3 > > V_t3 = V_a2 - V_a3 > > > > // Compute tmps > > // vector int {tmp[0][0], tmp[1][0], tmp[2][0], tmp[3][0]} > > V_tmp0 = V_t0 + V_t2 > > V_tmp2 = V_t0 - V_t2 > > V_tmp1 = V_t1 + V_t3 > > V_tmp3 = V_t1 - V_t3 > > > > // Final construct the {tmp[0][0], tmp[0][1], tmp[0][2], tmp[0][3]} ... > > // with six further permutat
RE: How to extend SLP to support this case
> -Original Message- > From: Gcc On Behalf Of Richard Biener > Sent: Tuesday, March 10, 2020 11:12 AM > To: Kewen.Lin > Cc: GCC Development ; Segher Boessenkool > > Subject: Re: How to extend SLP to support this case > > On Tue, Mar 10, 2020 at 7:52 AM Kewen.Lin wrote: > > > > Hi all, > > > > I'm investigating whether GCC can vectorize the below case on ppc64le. > > > > extern void test(unsigned int t[4][4]); > > > > void foo(unsigned char *p1, int i1, unsigned char *p2, int i2) > > { > > unsigned int tmp[4][4]; > > unsigned int a0, a1, a2, a3; > > > > for (int i = 0; i < 4; i++, p1 += i1, p2 += i2) { > > a0 = (p1[0] - p2[0]) + ((p1[4] - p2[4]) << 16); > > a1 = (p1[1] - p2[1]) + ((p1[5] - p2[5]) << 16); > > a2 = (p1[2] - p2[2]) + ((p1[6] - p2[6]) << 16); > > a3 = (p1[3] - p2[3]) + ((p1[7] - p2[7]) << 16); > > > > int t0 = a0 + a1; > > int t1 = a0 - a1; > > int t2 = a2 + a3; > > int t3 = a2 - a3; > > > > tmp[i][0] = t0 + t2; > > tmp[i][2] = t0 - t2; > > tmp[i][1] = t1 + t3; > > tmp[i][3] = t1 - t3; > > } > > test(tmp); > > } > > > > With unlimited costs, I saw loop aware SLP can vectorize it but with > > very inefficient codes. It builds the SLP instance from store group > > {tmp[i][0] tmp[i][1] tmp[i][2] tmp[i][3]}, builds nodes {a0, a0, a0, > > a0}, {a1, a1, a1, a1}, {a2, a2, a2, a2}, {a3, a3, a3, a3} after > > parsing operands for tmp* and t*. It means it's unable to make the > > isomorphic group for a0, a1, a2, a3, although they appears isomorphic > > to merge. Even if it can recognize over_widening pattern and do some > > parallel for two a0 from two iterations, but it's still inefficient (high > > cost). > > > > In this context, it looks better to build first by > > leveraging isomorphic computation trees constructing them, eg: > > w1_0123 = load_word(p1) > > V1_0123 = construct_vec(w1_0123) > > w1_4567 = load_word(p1 + 4) > > V1_4567 = construct_vec(w1_4567) > > w2_0123 = load_word(p2) > > V2_0123 = construct_vec(w2_0123) > > w2_4567 = load_word(p2 + 4) > > V2_4567 = construct_vec(w2_4567) > > V_a0123 = (V1_0123 - V2_0123) + (V1_4567 - V2_4567)<<16 > > > > But how to teach it to be aware of this? Currently the processing > > starts from bottom to up (from stores), can we do some analysis on the > > SLP instance, detect some pattern and update the whole instance? > > In theory yes (Tamar had something like that for AARCH64 complex rotations > IIRC). And yes, the issue boils down to how we handle SLP discovery. I'd > like > to improve SLP discovery but it's on my list only after I managed to get rid > of > the non-SLP code paths. I have played with some ideas (even produced > hackish patches) to find "seeds" to form SLP groups from using multi-level > hashing of stmts. I still have this but missed the stage-1 deadline after doing the rewriting to C++ 😊 We've also been looking at this and the approach I'm investigating now is trying to get the SLP codepath to handle this after it's been fully unrolled. I'm looking into whether the build-slp can be improved to work for the group size == 16 case that it tries but fails on. My intention is to see if doing so would make it simpler to recognize this as just 4 linear loads and two permutes. I think the loop aware SLP will have a much harder time with this seeing the load permutations it thinks it needs because of the permutes caused by the +/- pattern. One Idea I had before was from your comment on the complex number patch, which is to try and move up TWO_OPERATORS and undo the permute always when doing +/-. This would simplify the load permute handling and if a target doesn't have an instruction to support this it would just fall back to doing an explicit permute after the loads. But I wasn't sure this approach would get me the results I wanted. In the end you don't want a loop here at all. And in order to do the above with TWO_OPERATORS I would have to let the SLP pattern matcher be able to reduce the group size and increase the no# iterations during the matching otherwise the matching itself becomes quite difficult in certain cases.. Tamar > My plan is to rewrite SLP discovery completely, starting from a SLP graph that > 1:1 reflects the SSA use-def graph (no groups formed yet) and then form > groups from seeds and insert "connection" nodes (merging two subgraphs > with less lanes into one with more lanes, splitting lanes, permuting lanes, > etc.). > > Currently I'm working on doing exactly this but only for SLP loads (because > that's where it's most difficult...). > > > Besides, I also tried whether the standalone SLP pass can handle it, > > it failed to. It stops at a* due to incompatible vector types. > > > > The optimal vectorized SLP codes can be: > > // p1 byte 0 to byte 7 > > d1_0_7 = load_dword(p1) > > // p1+i1 b0 to b7, rename it as 8 to 15 > > d1_8_15 = load_dword(p1 + i1)
Mailman defaults to duplicate suppression
It seems that the Mailman migration has turned on duplicate suppression: if a recipient address is mentioned in the message headers, Mailman will not distribute the message to that recipient address, under the assumption that the message will make it to the recipient directly from the original poster. This breaks threads in local mailing list archives. (It's not visibile with Gmail because of its automated message disposal feature.) It's possible to turn this off in the Mailman settings, but I think the default will be applied to all new subscriptions, so this is a bit annoying because you need to remember to disable it again.
Tú nueva oficina
Incluye todos los servicios ** Si no puedes ver este correo ** haz click aquí (https://coworking.fsbusinesscenters.com/) https://coworking.fsbusinesscenters.com/ Tel. (81) 8000-1400 WhatsApp. 812-724-2310 (https://api.whatsapp.com/send?phone=528127242310&text=Hola,%20solicito%20información%20de%20sus%20oficinas.) ofici...@fsbusinesscenters.com (mailto:infoofici...@fsbc.mx) www.fsbusinesscenters.com (https://coworking.fsbusinesscenters.com/) Para más información haz click aquí (https://coworking.fsbusinesscenters.com/) This email was sent to gcc@gcc.gnu.org (mailto:gcc@gcc.gnu.org) why did I get this? (https://fsbc.us19.list-manage.com/about?u=2664aac1ffa1ac200f35f7597&id=6be22684b8&e=808f5a2edc&c=1f9e37d897) unsubscribe from this list (https://fsbc.us19.list-manage.com/unsubscribe?u=2664aac1ffa1ac200f35f7597&id=6be22684b8&e=808f5a2edc&c=1f9e37d897) update subscription preferences (https://fsbc.us19.list-manage.com/profile?u=2664aac1ffa1ac200f35f7597&id=6be22684b8&e=808f5a2edc) FS Business Centers . Paseo de los Leones 1684 . Col. Cumbres 1er Sector . Monterrey, NL 64610 . Mexico
Re: How do I run SIMD Testcases on PPC64?
‐‐‐ Original Message ‐‐‐ On Thursday, March 5, 2020 6:59 PM, Segher Boessenkool wrote: > On Thu, Mar 05, 2020 at 05:04:16PM +, GT wrote: > > > 2. Multiple other testcases in testsuite/gcc.dg/vect/ have this line at > > the top: > > /* { dg-additional-options "-mavx" { target avx_runtime } } */ > > An example is vect-simd-16.c > > > > > > 2.1 Should not these testcases be in directory testsuite/gcc.target/i386/ ? > > 2.2 To run vect-simd-16.c on PPC64, is it enough to put a copy of the file > > in testsuite/gcc.target/powerpc/ ? (After modifying target appropriately.) > > > It certainly would be nice if generic tests did not often have target- > specific stuff in them. In this case, it could be hidden in vect.exp, > perhaps? I got all the tests from RUNTESTFLAGS="vect.exp=*simd*" to PASS. The ICE errors I was getting came from simdlen clauses of #pragma omp declare simd directives. The vectorization code currently assumes that simdlen(n) is always valid and does not check whether the hardware supports n vectors. The tests assume a 256-bit wide vector unit. VSX width is 128-bit and hence the failures. We need to selectively change the simdlen clauses depending on the target on which the tests are being run. The typical test which needs this feature is gcc/testsuite/gcc.dg/vect/vect-simd-clone-1.c. Should I conditionally select #pragma omp directives using #ifdefs in the source files? Or is there a dejagnu feature which is preferred? Bert.
Re: How to extend SLP to support this case
Hi Richi, on 2020/3/10 下午7:12, Richard Biener wrote: > On Tue, Mar 10, 2020 at 7:52 AM Kewen.Lin wrote: >> >> Hi all, >> >> But how to teach it to be aware of this? Currently the processing starts >> from bottom to up (from stores), can we do some analysis on the SLP >> instance, detect some pattern and update the whole instance? > > In theory yes (Tamar had something like that for AARCH64 complex > rotations IIRC). And yes, the issue boils down to how we handle > SLP discovery. I'd like to improve SLP discovery but it's on my list > only after I managed to get rid of the non-SLP code paths. I have > played with some ideas (even produced hackish patches) to find > "seeds" to form SLP groups from using multi-level hashing of stmts. > My plan is to rewrite SLP discovery completely, starting from a > SLP graph that 1:1 reflects the SSA use-def graph (no groups > formed yet) and then form groups from seeds and insert > "connection" nodes (merging two subgraphs with less lanes > into one with more lanes, splitting lanes, permuting lanes, etc.). > Nice! Thanks for the information! This improvement sounds big and promising! If we can discovery SLP opportunities origins from loads, this case isn't a big deal then. > Currently I'm working on doing exactly this but only for SLP loads > (because that's where it's most difficult...). This case looks can be an input for your work? since the isomorphic computation are easy to detect from loads. >> A-way requires some additional vector permutations. However, I thought >> if the existing scheme can't get any SLP chances, it looks reasonable to >> extend it to consider this A-way grouping. Does it make sense? >> >> Another question is that even if we can go with A-way grouping, it can >> only handle packing one byte (four iteration -> 4), what place is >> reasonable to extend it to pack more? How about scaning all leaf >> nodes and consider/transform them together? too hacky? > > Well, not sure - in the end it will all be heuristics since otherwise > the exploration space is too big. But surely concentrating on > load/store groups is good. > Totally agreed. This hacky idea orgined from the existing codes, if SLP discovery improves, I think it's useless then. > The SLP discovery code is already quite complicated btw., I'd > hate to add "unstructured" hacks ontop of it right now without > future design goals in mind. OK. Looking forward to its landing! BR, Kewen
Re: How to extend SLP to support this case
Hi Tamar, on 2020/3/10 下午7:31, Tamar Christina wrote: > >> -Original Message- >> From: Gcc On Behalf Of Richard Biener >> Sent: Tuesday, March 10, 2020 11:12 AM >> To: Kewen.Lin >> Cc: GCC Development ; Segher Boessenkool >> >> Subject: Re: How to extend SLP to support this case >> >> On Tue, Mar 10, 2020 at 7:52 AM Kewen.Lin wrote: >>> >>> Hi all, >>> >>> But how to teach it to be aware of this? Currently the processing >>> starts from bottom to up (from stores), can we do some analysis on the >>> SLP instance, detect some pattern and update the whole instance? >> >> In theory yes (Tamar had something like that for AARCH64 complex rotations >> IIRC). And yes, the issue boils down to how we handle SLP discovery. I'd >> like >> to improve SLP discovery but it's on my list only after I managed to get rid >> of >> the non-SLP code paths. I have played with some ideas (even produced >> hackish patches) to find "seeds" to form SLP groups from using multi-level >> hashing of stmts. > > I still have this but missed the stage-1 deadline after doing the rewriting > to C++ 😊 > > We've also been looking at this and the approach I'm investigating now is > trying to get > the SLP codepath to handle this after it's been fully unrolled. I'm looking > into whether > the build-slp can be improved to work for the group size == 16 case that it > tries but fails > on. > Thanks! Glad to know you have been working this! Yes, I saw the standalone SLP pass split the group (16 store stmts) finally. > My intention is to see if doing so would make it simpler to recognize this as > just 4 linear > loads and two permutes. I think the loop aware SLP will have a much harder > time with this > seeing the load permutations it thinks it needs because of the permutes > caused by the +/- > pattern. I may miss something, just to double confirm, do you mean for either of p1/p2 make it 4 linear loads? Since as the optimal vectorized version, p1 and p2 have 4 separate loads and construction then further permutations. > > One Idea I had before was from your comment on the complex number patch, > which is to try > and move up TWO_OPERATORS and undo the permute always when doing +/-. This > would simplify > the load permute handling and if a target doesn't have an instruction to > support this it would just > fall back to doing an explicit permute after the loads. But I wasn't sure > this approach would get me the > results I wanted.> IIUC, we have to seek for either or ..., since either can leverage the isomorphic byte loads, subtraction, shift and addition. I was thinking that SLP pattern matcher can detect the pattern with two levels of TWO_OPERATORS, one level is with t/0,1,2,3,/, the other is with a/0,1,2,3/, as well as the dependent isomorphic computations for a/0,1,2,3/, transform it into isomorphic subtraction, int promotion shift and addition. > In the end you don't want a loop here at all. And in order to do the above > with TWO_OPERATORS I would > have to let the SLP pattern matcher be able to reduce the group size and > increase the no# iterations during > the matching otherwise the matching itself becomes quite difficult in certain > cases.. > OK, it sounds unable to get the optimal one which requires all 16 bytes (0-3 or 4-7 x 4 iterations). BR, Kewen
[no subject]
Sent from my iPhone
Re: How to extend SLP to support this case
Hi Richi, on 2020/3/10 下午7:14, Richard Biener wrote: > On Tue, Mar 10, 2020 at 12:12 PM Richard Biener > wrote: >> >> On Tue, Mar 10, 2020 at 7:52 AM Kewen.Lin wrote: >>> >>> Hi all, >>> >>> I'm investigating whether GCC can vectorize the below case on ppc64le. >>> >>> extern void test(unsigned int t[4][4]); >>> >>> void foo(unsigned char *p1, int i1, unsigned char *p2, int i2) >>> { >>> unsigned int tmp[4][4]; >>> unsigned int a0, a1, a2, a3; >>> >>> for (int i = 0; i < 4; i++, p1 += i1, p2 += i2) { >>> a0 = (p1[0] - p2[0]) + ((p1[4] - p2[4]) << 16); >>> a1 = (p1[1] - p2[1]) + ((p1[5] - p2[5]) << 16); >>> a2 = (p1[2] - p2[2]) + ((p1[6] - p2[6]) << 16); >>> a3 = (p1[3] - p2[3]) + ((p1[7] - p2[7]) << 16); >>> >>> int t0 = a0 + a1; >>> int t1 = a0 - a1; >>> int t2 = a2 + a3; >>> int t3 = a2 - a3; >>> >>> tmp[i][0] = t0 + t2; >>> tmp[i][2] = t0 - t2; >>> tmp[i][1] = t1 + t3; >>> tmp[i][3] = t1 - t3; >>> } >>> test(tmp); >>> } >>> ... >>> From the above, the key thing is to group tmp[i][j] i=/0,1,2,3/ together, >>> eg: >>> tmp[i][0] i=/0,1,2,3/ (one group) >>> tmp[i][1] i=/0,1,2,3/ (one group) >>> tmp[i][2] i=/0,1,2,3/ (one group) >>> tmp[i][3] i=/0,1,2,3/ (one group) >>> >>> which tmp[i][j] group have the same isomorphic computations. But currently >>> SLP is unable to divide group like this way. (call it as A-way for now) >>> >>> It's understandable since it has better adjacent store groups like, >>> tmp[0][i] i=/0,1,2,3/ (one group) >>> tmp[1][i] i=/0,1,2,3/ (one group) >>> tmp[2][i] i=/0,1,2,3/ (one group) >>> tmp[3][i] i=/0,1,2,3/ (one group) > > Note this is how the non-SLP path will (try to) vectorize the loop. > Oops, sorry for the confusion with poor writing, it's intended to show how the current SLP group those 16 stores tmp[i][j] i,j=/0,1,2,3/ with completely unrolled. I saw it split 16 stmts into 4 groups like this way finally. BR, Kewen