Re:outdoor LED

2020-03-10 Thread Cathy
Hi My friend,

Thanks for reading. Have a nice day!

This is Matt from Sontec lighting, we have 20+ years in led.

WHAT WE DO: 
OEM led lighting, focus on led highbay, shoebox light, wall pack and packing 
lot led light.
How I can help you:
1, Increase your sales turnover through reasonable price and better quality. 
2, Help you build a strong products to pull ahead of your competition.
3, Special 5 days delivery program on customized orders.


 Why I can help you?
★ 20 years working with overseas buyer in international business Field, 
including led lighting top brands worldwide.
★ Special Customized order at fast delivery.
★ 86,000 sq ft production base with 100+ workers


Pls contact us now for details.

With Best Regards, 
Cathy 
China: 86-13822349468
Your China best OEM supplier
Sontec LED lighting


Re: Your China LED factory

2020-03-10 Thread Cathy
Hi My friend,

Thanks for reading. Have a nice day!

This is Matt from Sontec lighting, we have 20+ years in led.

WHAT WE DO: 
OEM led lighting, focus on led highbay, shoebox light, wall pack and packing 
lot led light.
How I can help you:
1, Increase your sales turnover through reasonable price and better quality. 
2, Help you build a strong products to pull ahead of your competition.
3, Special 5 days delivery program on customized orders.


 Why I can help you?
★ 20 years working with overseas buyer in international business Field, 
including led lighting top brands worldwide.
★ Special Customized order at fast delivery.
★ 86,000 sq ft production base with 100+ workers


Pls contact us now for details.

With Best Regards, 
Cathy 
China: 86-13822349468
Your China best OEM supplier
Sontec LED lighting


Re: How to extend SLP to support this case

2020-03-10 Thread Richard Biener
On Tue, Mar 10, 2020 at 7:52 AM Kewen.Lin  wrote:
>
> Hi all,
>
> I'm investigating whether GCC can vectorize the below case on ppc64le.
>
>   extern void test(unsigned int t[4][4]);
>
>   void foo(unsigned char *p1, int i1, unsigned char *p2, int i2)
>   {
> unsigned int tmp[4][4];
> unsigned int a0, a1, a2, a3;
>
> for (int i = 0; i < 4; i++, p1 += i1, p2 += i2) {
>   a0 = (p1[0] - p2[0]) + ((p1[4] - p2[4]) << 16);
>   a1 = (p1[1] - p2[1]) + ((p1[5] - p2[5]) << 16);
>   a2 = (p1[2] - p2[2]) + ((p1[6] - p2[6]) << 16);
>   a3 = (p1[3] - p2[3]) + ((p1[7] - p2[7]) << 16);
>
>   int t0 = a0 + a1;
>   int t1 = a0 - a1;
>   int t2 = a2 + a3;
>   int t3 = a2 - a3;
>
>   tmp[i][0] = t0 + t2;
>   tmp[i][2] = t0 - t2;
>   tmp[i][1] = t1 + t3;
>   tmp[i][3] = t1 - t3;
> }
> test(tmp);
>   }
>
> With unlimited costs, I saw loop aware SLP can vectorize it but with very
> inefficient codes.  It builds the SLP instance from store group {tmp[i][0]
> tmp[i][1] tmp[i][2] tmp[i][3]}, builds nodes {a0, a0, a0, a0},
> {a1, a1, a1, a1}, {a2, a2, a2, a2}, {a3, a3, a3, a3} after parsing operands
> for tmp* and t*.  It means it's unable to make the isomorphic group for
> a0, a1, a2, a3, although they appears isomorphic to merge.  Even if it can
> recognize over_widening pattern and do some parallel for two a0 from two
> iterations, but it's still inefficient (high cost).
>
> In this context, it looks better to build  first by
> leveraging isomorphic computation trees constructing them, eg:
>   w1_0123 = load_word(p1)
>   V1_0123 = construct_vec(w1_0123)
>   w1_4567 = load_word(p1 + 4)
>   V1_4567 = construct_vec(w1_4567)
>   w2_0123 = load_word(p2)
>   V2_0123 = construct_vec(w2_0123)
>   w2_4567 = load_word(p2 + 4)
>   V2_4567 = construct_vec(w2_4567)
>   V_a0123 = (V1_0123 - V2_0123) + (V1_4567 - V2_4567)<<16
>
> But how to teach it to be aware of this? Currently the processing starts
> from bottom to up (from stores), can we do some analysis on the SLP
> instance, detect some pattern and update the whole instance?

In theory yes (Tamar had something like that for AARCH64 complex
rotations IIRC).  And yes, the issue boils down to how we handle
SLP discovery.  I'd like to improve SLP discovery but it's on my list
only after I managed to get rid of the non-SLP code paths.  I have
played with some ideas (even produced hackish patches) to find
"seeds" to form SLP groups from using multi-level hashing of stmts.
My plan is to rewrite SLP discovery completely, starting from a
SLP graph that 1:1 reflects the SSA use-def graph (no groups
formed yet) and then form groups from seeds and insert
"connection" nodes (merging two subgraphs with less lanes
into one with more lanes, splitting lanes, permuting lanes, etc.).

Currently I'm working on doing exactly this but only for SLP loads
(because that's where it's most difficult...).

> Besides, I also tried whether the standalone SLP pass can handle it,
> it failed to. It stops at a* due to incompatible vector types.
>
> The optimal vectorized SLP codes can be:
>   // p1 byte 0 to byte 7
>   d1_0_7 = load_dword(p1)
>   // p1+i1 b0 to b7, rename it as 8 to 15
>   d1_8_15 = load_dword(p1 + i1)
>   d1_16_23 = load_dword(p1 + 2*i1)
>   d1_24_31 = load_dword(p1 + 3*i1)
>
>   V_d1_0_15 = construct_vec(d1_0_7,d1_8_15) // vector char
>   V_d1_16_31 = construct_vec(d1_16_23,d1_24_31)
>   V_d1_0_3_all = vperm(V_d1_0_15, V_d1_0_15,
>   {0 8 16 24 1 9 17 25 2 10 18 26 3 11 19 27})
>   V_d1_4_7_all = vperm(V_d1_0_15, V_d1_0_15,
>   {4 12 20 28 5 13 21 29 6 14 22 30 7 15 23 31})
>
>   // Do the similar for p2 with i2, get V_d2_0_3_all, V_d2_4_7_all
>
>   // Do the subtraction together (all 4x4 bytes)
>   V_sub1 = V_d1_0_3_all - V_d2_0_3_all
>   V_sub2 = V_d1_4_7_all - V_d2_4_7_all
>
>   // Do some unpack and get the promoted vector int
>   V_a0_tmp = vec_promote(V_sub2, {0 1 2 3}) // vector int {b4 b12 b20 b28}
>   V_a0_1 = V_a0_tmp << 16
>   V_a0_0 = vec_promote(V_sub1, {0 1 2 3}).  // vector int {b0 b8 b16 b24}
>   // vector int {a0_iter0, a0_iter1, a0_iter2, a0_iter3}
>   V_a0 = V_a0_0 + V_a0_1
>
>   // Get the similar for V_a1, V_a2, V_a3
>
>   // Compute t0/t1/t2/t3
>   // vector int {t0_iter0, t0_iter1, t0_iter2, t0_iter3}
>   V_t0 = V_a0 + V_a1
>   V_t1 = V_a0 - V_a1
>   V_t2 = V_a2 + V_a3
>   V_t3 = V_a2 - V_a3
>
>   // Compute tmps
>   // vector int {tmp[0][0], tmp[1][0], tmp[2][0], tmp[3][0]}
>   V_tmp0 = V_t0 + V_t2
>   V_tmp2 = V_t0 - V_t2
>   V_tmp1 = V_t1 + V_t3
>   V_tmp3 = V_t1 - V_t3
>
>   // Final construct the {tmp[0][0], tmp[0][1], tmp[0][2], tmp[0][3]} ...
>   // with six further permutation on V_tmp0/V_tmp1/V_tmp2/V_tmp3
>
> From the above, the key thing is to group tmp[i][j] i=/0,1,2,3/ together, eg:
>   tmp[i][0] i=/0,1,2,3/ (one group)
>   tmp[i][1] i=/0,1,2,3/ (one group)
>   tmp[i][2] i=/0,1,2,3/ (one group)
>   tmp[i][3] i=/0,1,2,3/ (one group)
>
> which tmp[i][j] group ha

Re: How to extend SLP to support this case

2020-03-10 Thread Richard Biener
On Tue, Mar 10, 2020 at 12:12 PM Richard Biener
 wrote:
>
> On Tue, Mar 10, 2020 at 7:52 AM Kewen.Lin  wrote:
> >
> > Hi all,
> >
> > I'm investigating whether GCC can vectorize the below case on ppc64le.
> >
> >   extern void test(unsigned int t[4][4]);
> >
> >   void foo(unsigned char *p1, int i1, unsigned char *p2, int i2)
> >   {
> > unsigned int tmp[4][4];
> > unsigned int a0, a1, a2, a3;
> >
> > for (int i = 0; i < 4; i++, p1 += i1, p2 += i2) {
> >   a0 = (p1[0] - p2[0]) + ((p1[4] - p2[4]) << 16);
> >   a1 = (p1[1] - p2[1]) + ((p1[5] - p2[5]) << 16);
> >   a2 = (p1[2] - p2[2]) + ((p1[6] - p2[6]) << 16);
> >   a3 = (p1[3] - p2[3]) + ((p1[7] - p2[7]) << 16);
> >
> >   int t0 = a0 + a1;
> >   int t1 = a0 - a1;
> >   int t2 = a2 + a3;
> >   int t3 = a2 - a3;
> >
> >   tmp[i][0] = t0 + t2;
> >   tmp[i][2] = t0 - t2;
> >   tmp[i][1] = t1 + t3;
> >   tmp[i][3] = t1 - t3;
> > }
> > test(tmp);
> >   }
> >
> > With unlimited costs, I saw loop aware SLP can vectorize it but with very
> > inefficient codes.  It builds the SLP instance from store group {tmp[i][0]
> > tmp[i][1] tmp[i][2] tmp[i][3]}, builds nodes {a0, a0, a0, a0},
> > {a1, a1, a1, a1}, {a2, a2, a2, a2}, {a3, a3, a3, a3} after parsing operands
> > for tmp* and t*.  It means it's unable to make the isomorphic group for
> > a0, a1, a2, a3, although they appears isomorphic to merge.  Even if it can
> > recognize over_widening pattern and do some parallel for two a0 from two
> > iterations, but it's still inefficient (high cost).
> >
> > In this context, it looks better to build  first by
> > leveraging isomorphic computation trees constructing them, eg:
> >   w1_0123 = load_word(p1)
> >   V1_0123 = construct_vec(w1_0123)
> >   w1_4567 = load_word(p1 + 4)
> >   V1_4567 = construct_vec(w1_4567)
> >   w2_0123 = load_word(p2)
> >   V2_0123 = construct_vec(w2_0123)
> >   w2_4567 = load_word(p2 + 4)
> >   V2_4567 = construct_vec(w2_4567)
> >   V_a0123 = (V1_0123 - V2_0123) + (V1_4567 - V2_4567)<<16
> >
> > But how to teach it to be aware of this? Currently the processing starts
> > from bottom to up (from stores), can we do some analysis on the SLP
> > instance, detect some pattern and update the whole instance?
>
> In theory yes (Tamar had something like that for AARCH64 complex
> rotations IIRC).  And yes, the issue boils down to how we handle
> SLP discovery.  I'd like to improve SLP discovery but it's on my list
> only after I managed to get rid of the non-SLP code paths.  I have
> played with some ideas (even produced hackish patches) to find
> "seeds" to form SLP groups from using multi-level hashing of stmts.
> My plan is to rewrite SLP discovery completely, starting from a
> SLP graph that 1:1 reflects the SSA use-def graph (no groups
> formed yet) and then form groups from seeds and insert
> "connection" nodes (merging two subgraphs with less lanes
> into one with more lanes, splitting lanes, permuting lanes, etc.).
>
> Currently I'm working on doing exactly this but only for SLP loads
> (because that's where it's most difficult...).
>
> > Besides, I also tried whether the standalone SLP pass can handle it,
> > it failed to. It stops at a* due to incompatible vector types.
> >
> > The optimal vectorized SLP codes can be:
> >   // p1 byte 0 to byte 7
> >   d1_0_7 = load_dword(p1)
> >   // p1+i1 b0 to b7, rename it as 8 to 15
> >   d1_8_15 = load_dword(p1 + i1)
> >   d1_16_23 = load_dword(p1 + 2*i1)
> >   d1_24_31 = load_dword(p1 + 3*i1)
> >
> >   V_d1_0_15 = construct_vec(d1_0_7,d1_8_15) // vector char
> >   V_d1_16_31 = construct_vec(d1_16_23,d1_24_31)
> >   V_d1_0_3_all = vperm(V_d1_0_15, V_d1_0_15,
> >   {0 8 16 24 1 9 17 25 2 10 18 26 3 11 19 27})
> >   V_d1_4_7_all = vperm(V_d1_0_15, V_d1_0_15,
> >   {4 12 20 28 5 13 21 29 6 14 22 30 7 15 23 31})
> >
> >   // Do the similar for p2 with i2, get V_d2_0_3_all, V_d2_4_7_all
> >
> >   // Do the subtraction together (all 4x4 bytes)
> >   V_sub1 = V_d1_0_3_all - V_d2_0_3_all
> >   V_sub2 = V_d1_4_7_all - V_d2_4_7_all
> >
> >   // Do some unpack and get the promoted vector int
> >   V_a0_tmp = vec_promote(V_sub2, {0 1 2 3}) // vector int {b4 b12 b20 b28}
> >   V_a0_1 = V_a0_tmp << 16
> >   V_a0_0 = vec_promote(V_sub1, {0 1 2 3}).  // vector int {b0 b8 b16 b24}
> >   // vector int {a0_iter0, a0_iter1, a0_iter2, a0_iter3}
> >   V_a0 = V_a0_0 + V_a0_1
> >
> >   // Get the similar for V_a1, V_a2, V_a3
> >
> >   // Compute t0/t1/t2/t3
> >   // vector int {t0_iter0, t0_iter1, t0_iter2, t0_iter3}
> >   V_t0 = V_a0 + V_a1
> >   V_t1 = V_a0 - V_a1
> >   V_t2 = V_a2 + V_a3
> >   V_t3 = V_a2 - V_a3
> >
> >   // Compute tmps
> >   // vector int {tmp[0][0], tmp[1][0], tmp[2][0], tmp[3][0]}
> >   V_tmp0 = V_t0 + V_t2
> >   V_tmp2 = V_t0 - V_t2
> >   V_tmp1 = V_t1 + V_t3
> >   V_tmp3 = V_t1 - V_t3
> >
> >   // Final construct the {tmp[0][0], tmp[0][1], tmp[0][2], tmp[0][3]} ...
> >   // with six further permutat

RE: How to extend SLP to support this case

2020-03-10 Thread Tamar Christina

> -Original Message-
> From: Gcc  On Behalf Of Richard Biener
> Sent: Tuesday, March 10, 2020 11:12 AM
> To: Kewen.Lin 
> Cc: GCC Development ; Segher Boessenkool
> 
> Subject: Re: How to extend SLP to support this case
> 
> On Tue, Mar 10, 2020 at 7:52 AM Kewen.Lin  wrote:
> >
> > Hi all,
> >
> > I'm investigating whether GCC can vectorize the below case on ppc64le.
> >
> >   extern void test(unsigned int t[4][4]);
> >
> >   void foo(unsigned char *p1, int i1, unsigned char *p2, int i2)
> >   {
> > unsigned int tmp[4][4];
> > unsigned int a0, a1, a2, a3;
> >
> > for (int i = 0; i < 4; i++, p1 += i1, p2 += i2) {
> >   a0 = (p1[0] - p2[0]) + ((p1[4] - p2[4]) << 16);
> >   a1 = (p1[1] - p2[1]) + ((p1[5] - p2[5]) << 16);
> >   a2 = (p1[2] - p2[2]) + ((p1[6] - p2[6]) << 16);
> >   a3 = (p1[3] - p2[3]) + ((p1[7] - p2[7]) << 16);
> >
> >   int t0 = a0 + a1;
> >   int t1 = a0 - a1;
> >   int t2 = a2 + a3;
> >   int t3 = a2 - a3;
> >
> >   tmp[i][0] = t0 + t2;
> >   tmp[i][2] = t0 - t2;
> >   tmp[i][1] = t1 + t3;
> >   tmp[i][3] = t1 - t3;
> > }
> > test(tmp);
> >   }
> >
> > With unlimited costs, I saw loop aware SLP can vectorize it but with
> > very inefficient codes.  It builds the SLP instance from store group
> > {tmp[i][0] tmp[i][1] tmp[i][2] tmp[i][3]}, builds nodes {a0, a0, a0,
> > a0}, {a1, a1, a1, a1}, {a2, a2, a2, a2}, {a3, a3, a3, a3} after
> > parsing operands for tmp* and t*.  It means it's unable to make the
> > isomorphic group for a0, a1, a2, a3, although they appears isomorphic
> > to merge.  Even if it can recognize over_widening pattern and do some
> > parallel for two a0 from two iterations, but it's still inefficient (high 
> > cost).
> >
> > In this context, it looks better to build  first by
> > leveraging isomorphic computation trees constructing them, eg:
> >   w1_0123 = load_word(p1)
> >   V1_0123 = construct_vec(w1_0123)
> >   w1_4567 = load_word(p1 + 4)
> >   V1_4567 = construct_vec(w1_4567)
> >   w2_0123 = load_word(p2)
> >   V2_0123 = construct_vec(w2_0123)
> >   w2_4567 = load_word(p2 + 4)
> >   V2_4567 = construct_vec(w2_4567)
> >   V_a0123 = (V1_0123 - V2_0123) + (V1_4567 - V2_4567)<<16
> >
> > But how to teach it to be aware of this? Currently the processing
> > starts from bottom to up (from stores), can we do some analysis on the
> > SLP instance, detect some pattern and update the whole instance?
> 
> In theory yes (Tamar had something like that for AARCH64 complex rotations
> IIRC).  And yes, the issue boils down to how we handle SLP discovery.  I'd 
> like
> to improve SLP discovery but it's on my list only after I managed to get rid 
> of
> the non-SLP code paths.  I have played with some ideas (even produced
> hackish patches) to find "seeds" to form SLP groups from using multi-level
> hashing of stmts.

I still have this but missed the stage-1 deadline after doing the rewriting to 
C++ 😊

We've also been looking at this and the approach I'm investigating now is 
trying to get
the SLP codepath to handle this after it's been fully unrolled. I'm looking 
into whether
the build-slp can be improved to work for the group size == 16 case that it 
tries but fails
on. 

My intention is to see if doing so would make it simpler to recognize this as 
just 4 linear
loads and two permutes. I think the loop aware SLP will have a much harder time 
with this
seeing the load permutations it thinks it needs because of the permutes caused 
by the +/-
pattern.

One Idea I had before was from your comment on the complex number patch, which 
is to try
and move up TWO_OPERATORS and undo the permute always when doing +/-. This 
would simplify
the load permute handling and if a target doesn't have an instruction to 
support this it would just
fall back to doing an explicit permute after the loads.  But I wasn't sure this 
approach would get me the
results I wanted.

In the end you don't want a loop here at all. And in order to do the above with 
TWO_OPERATORS I would
have to let the SLP pattern matcher be able to reduce the group size and 
increase the no# iterations during
the matching otherwise the matching itself becomes quite difficult in certain 
cases..

Tamar

> My plan is to rewrite SLP discovery completely, starting from a SLP graph that
> 1:1 reflects the SSA use-def graph (no groups formed yet) and then form
> groups from seeds and insert "connection" nodes (merging two subgraphs
> with less lanes into one with more lanes, splitting lanes, permuting lanes,
> etc.).
> 
> Currently I'm working on doing exactly this but only for SLP loads (because
> that's where it's most difficult...).
> 
> > Besides, I also tried whether the standalone SLP pass can handle it,
> > it failed to. It stops at a* due to incompatible vector types.
> >
> > The optimal vectorized SLP codes can be:
> >   // p1 byte 0 to byte 7
> >   d1_0_7 = load_dword(p1)
> >   // p1+i1 b0 to b7, rename it as 8 to 15
> >   d1_8_15 = load_dword(p1 + i1)

Mailman defaults to duplicate suppression

2020-03-10 Thread Florian Weimer
It seems that the Mailman migration has turned on duplicate
suppression: if a recipient address is mentioned in the message
headers, Mailman will not distribute the message to that recipient
address, under the assumption that the message will make it to the
recipient directly from the original poster.

This breaks threads in local mailing list archives.  (It's not
visibile with Gmail because of its automated message disposal
feature.)

It's possible to turn this off in the Mailman settings, but I think
the default will be applied to all new subscriptions, so this is a bit
annoying because you need to remember to disable it again.


Tú nueva oficina

2020-03-10 Thread FS Business Centers
Incluye todos los servicios


** Si no puedes ver este correo



** haz click aquí (https://coworking.fsbusinesscenters.com/)

https://coworking.fsbusinesscenters.com/
Tel. (81) 8000-1400
WhatsApp. 812-724-2310 
(https://api.whatsapp.com/send?phone=528127242310&text=Hola,%20solicito%20información%20de%20sus%20oficinas.)
ofici...@fsbusinesscenters.com (mailto:infoofici...@fsbc.mx)
www.fsbusinesscenters.com (https://coworking.fsbusinesscenters.com/)

Para más información haz click aquí (https://coworking.fsbusinesscenters.com/)



This email was sent to gcc@gcc.gnu.org (mailto:gcc@gcc.gnu.org)
why did I get this? 
(https://fsbc.us19.list-manage.com/about?u=2664aac1ffa1ac200f35f7597&id=6be22684b8&e=808f5a2edc&c=1f9e37d897)
 unsubscribe from this list 
(https://fsbc.us19.list-manage.com/unsubscribe?u=2664aac1ffa1ac200f35f7597&id=6be22684b8&e=808f5a2edc&c=1f9e37d897)
 update subscription preferences 
(https://fsbc.us19.list-manage.com/profile?u=2664aac1ffa1ac200f35f7597&id=6be22684b8&e=808f5a2edc)
FS Business Centers . Paseo de los Leones 1684 . Col. Cumbres 1er Sector . 
Monterrey, NL 64610 . Mexico


Re: How do I run SIMD Testcases on PPC64?

2020-03-10 Thread GT via Gcc
‐‐‐ Original Message ‐‐‐
On Thursday, March 5, 2020 6:59 PM, Segher Boessenkool 
 wrote:

> On Thu, Mar 05, 2020 at 05:04:16PM +, GT wrote:
>
> > 2.  Multiple other testcases in testsuite/gcc.dg/vect/ have this line at 
> > the top:
> > /* { dg-additional-options "-mavx" { target avx_runtime } } */
> > An example is vect-simd-16.c
> >
> >
> > 2.1 Should not these testcases be in directory testsuite/gcc.target/i386/ ?
> > 2.2 To run vect-simd-16.c on PPC64, is it enough to put a copy of the file
> > in testsuite/gcc.target/powerpc/ ? (After modifying target appropriately.)
>

>
> It certainly would be nice if generic tests did not often have target-
> specific stuff in them. In this case, it could be hidden in vect.exp,
> perhaps?

I got all the tests from RUNTESTFLAGS="vect.exp=*simd*" to PASS. The ICE errors 
I was
getting came from simdlen clauses of #pragma omp declare simd directives. The
vectorization code currently assumes that simdlen(n) is always valid and does 
not
check whether the hardware supports n vectors. The tests assume a 256-bit wide
vector unit. VSX width is 128-bit and hence the failures.

We need to selectively change the simdlen clauses depending on the target on 
which
the tests are being run. The typical test which needs this feature is
gcc/testsuite/gcc.dg/vect/vect-simd-clone-1.c. Should I conditionally select
#pragma omp directives using #ifdefs in the source files? Or is there a dejagnu
feature which is preferred?

Bert.


Re: How to extend SLP to support this case

2020-03-10 Thread Kewen.Lin via Gcc
Hi Richi,

on 2020/3/10 下午7:12, Richard Biener wrote:
> On Tue, Mar 10, 2020 at 7:52 AM Kewen.Lin  wrote:
>>
>> Hi all,
>>
>> But how to teach it to be aware of this? Currently the processing starts
>> from bottom to up (from stores), can we do some analysis on the SLP
>> instance, detect some pattern and update the whole instance?
> 
> In theory yes (Tamar had something like that for AARCH64 complex
> rotations IIRC).  And yes, the issue boils down to how we handle
> SLP discovery.  I'd like to improve SLP discovery but it's on my list
> only after I managed to get rid of the non-SLP code paths.  I have
> played with some ideas (even produced hackish patches) to find
> "seeds" to form SLP groups from using multi-level hashing of stmts.
> My plan is to rewrite SLP discovery completely, starting from a
> SLP graph that 1:1 reflects the SSA use-def graph (no groups
> formed yet) and then form groups from seeds and insert
> "connection" nodes (merging two subgraphs with less lanes
> into one with more lanes, splitting lanes, permuting lanes, etc.).
> 

Nice!  Thanks for the information!  This improvement sounds big and
promising!  If we can discovery SLP opportunities origins from loads,
this case isn't a big deal then.

> Currently I'm working on doing exactly this but only for SLP loads
> (because that's where it's most difficult...).

This case looks can be an input for your work?  since the isomorphic
computation are easy to detect from loads.

>> A-way requires some additional vector permutations.  However, I thought
>> if the existing scheme can't get any SLP chances, it looks reasonable to
>> extend it to consider this A-way grouping.  Does it make sense?
>>
>> Another question is that even if we can go with A-way grouping, it can
>> only handle packing one byte (four iteration -> 4), what place is
>> reasonable to extend it to pack more?  How about scaning all leaf
>> nodes and consider/transform them together?  too hacky?
> 
> Well, not sure - in the end it will all be heuristics since otherwise
> the exploration space is too big.  But surely concentrating on
> load/store groups is good.
> 

Totally agreed.  This hacky idea orgined from the existing codes, if SLP
discovery improves, I think it's useless then.

> The SLP discovery code is already quite complicated btw., I'd
> hate to add "unstructured" hacks ontop of it right now without
> future design goals in mind.

OK.  Looking forward to its landing!

BR,
Kewen



Re: How to extend SLP to support this case

2020-03-10 Thread Kewen.Lin via Gcc
Hi Tamar,

on 2020/3/10 下午7:31, Tamar Christina wrote:
> 
>> -Original Message-
>> From: Gcc  On Behalf Of Richard Biener
>> Sent: Tuesday, March 10, 2020 11:12 AM
>> To: Kewen.Lin 
>> Cc: GCC Development ; Segher Boessenkool
>> 
>> Subject: Re: How to extend SLP to support this case
>>
>> On Tue, Mar 10, 2020 at 7:52 AM Kewen.Lin  wrote:
>>>
>>> Hi all,
>>>
>>> But how to teach it to be aware of this? Currently the processing
>>> starts from bottom to up (from stores), can we do some analysis on the
>>> SLP instance, detect some pattern and update the whole instance?
>>
>> In theory yes (Tamar had something like that for AARCH64 complex rotations
>> IIRC).  And yes, the issue boils down to how we handle SLP discovery.  I'd 
>> like
>> to improve SLP discovery but it's on my list only after I managed to get rid 
>> of
>> the non-SLP code paths.  I have played with some ideas (even produced
>> hackish patches) to find "seeds" to form SLP groups from using multi-level
>> hashing of stmts.
> 
> I still have this but missed the stage-1 deadline after doing the rewriting 
> to C++ 😊
> 
> We've also been looking at this and the approach I'm investigating now is 
> trying to get
> the SLP codepath to handle this after it's been fully unrolled. I'm looking 
> into whether
> the build-slp can be improved to work for the group size == 16 case that it 
> tries but fails
> on. 
> 

Thanks!  Glad to know you have been working this!

Yes, I saw the standalone SLP pass split the group (16 store stmts) finally.

> My intention is to see if doing so would make it simpler to recognize this as 
> just 4 linear
> loads and two permutes. I think the loop aware SLP will have a much harder 
> time with this
> seeing the load permutations it thinks it needs because of the permutes 
> caused by the +/-
> pattern.

I may miss something, just to double confirm, do you mean for either of p1/p2 
make it 
4 linear loads?  Since as the optimal vectorized version, p1 and p2 have 4 
separate
loads and construction then further permutations.

> 
> One Idea I had before was from your comment on the complex number patch, 
> which is to try
> and move up TWO_OPERATORS and undo the permute always when doing +/-. This 
> would simplify
> the load permute handling and if a target doesn't have an instruction to 
> support this it would just
> fall back to doing an explicit permute after the loads.  But I wasn't sure 
> this approach would get me the
> results I wanted.> 

IIUC, we have to seek for either  or  ...,
since either can leverage the isomorphic byte loads, subtraction, shift and 
addition.

I was thinking that SLP pattern matcher can detect the pattern with two levels 
of TWO_OPERATORS,
one level is with t/0,1,2,3,/, the other is with a/0,1,2,3/, as well as the 
dependent isomorphic
computations for a/0,1,2,3/, transform it into isomorphic subtraction, int 
promotion shift and addition.

> In the end you don't want a loop here at all. And in order to do the above 
> with TWO_OPERATORS I would
> have to let the SLP pattern matcher be able to reduce the group size and 
> increase the no# iterations during
> the matching otherwise the matching itself becomes quite difficult in certain 
> cases..
> 

OK, it sounds unable to get the optimal one which requires all 16 bytes (0-3 or 
4-7 x 4 iterations).

BR,
Kewen



[no subject]

2020-03-10 Thread busterduke73--- via Gcc



Sent from my 

iPhone


Re: How to extend SLP to support this case

2020-03-10 Thread Kewen.Lin via Gcc
Hi Richi,

on 2020/3/10 下午7:14, Richard Biener wrote:
> On Tue, Mar 10, 2020 at 12:12 PM Richard Biener
>  wrote:
>>
>> On Tue, Mar 10, 2020 at 7:52 AM Kewen.Lin  wrote:
>>>
>>> Hi all,
>>>
>>> I'm investigating whether GCC can vectorize the below case on ppc64le.
>>>
>>>   extern void test(unsigned int t[4][4]);
>>>
>>>   void foo(unsigned char *p1, int i1, unsigned char *p2, int i2)
>>>   {
>>> unsigned int tmp[4][4];
>>> unsigned int a0, a1, a2, a3;
>>>
>>> for (int i = 0; i < 4; i++, p1 += i1, p2 += i2) {
>>>   a0 = (p1[0] - p2[0]) + ((p1[4] - p2[4]) << 16);
>>>   a1 = (p1[1] - p2[1]) + ((p1[5] - p2[5]) << 16);
>>>   a2 = (p1[2] - p2[2]) + ((p1[6] - p2[6]) << 16);
>>>   a3 = (p1[3] - p2[3]) + ((p1[7] - p2[7]) << 16);
>>>
>>>   int t0 = a0 + a1;
>>>   int t1 = a0 - a1;
>>>   int t2 = a2 + a3;
>>>   int t3 = a2 - a3;
>>>
>>>   tmp[i][0] = t0 + t2;
>>>   tmp[i][2] = t0 - t2;
>>>   tmp[i][1] = t1 + t3;
>>>   tmp[i][3] = t1 - t3;
>>> }
>>> test(tmp);
>>>   }
>>>
...
>>> From the above, the key thing is to group tmp[i][j] i=/0,1,2,3/ together, 
>>> eg:
>>>   tmp[i][0] i=/0,1,2,3/ (one group)
>>>   tmp[i][1] i=/0,1,2,3/ (one group)
>>>   tmp[i][2] i=/0,1,2,3/ (one group)
>>>   tmp[i][3] i=/0,1,2,3/ (one group)
>>>
>>> which tmp[i][j] group have the same isomorphic computations.  But currently
>>> SLP is unable to divide group like this way. (call it as A-way for now)
>>>
>>> It's understandable since it has better adjacent store groups like,
>>>   tmp[0][i] i=/0,1,2,3/ (one group)
>>>   tmp[1][i] i=/0,1,2,3/ (one group)
>>>   tmp[2][i] i=/0,1,2,3/ (one group)
>>>   tmp[3][i] i=/0,1,2,3/ (one group)
> 
> Note this is how the non-SLP path will (try to) vectorize the loop.
> 

Oops, sorry for the confusion with poor writing, it's intended to show
how the current SLP group those 16 stores tmp[i][j] i,j=/0,1,2,3/
with completely unrolled.  I saw it split 16 stmts into 4 groups like
this way finally.

BR,
Kewen