LGTM, thanks.
> -----Original Message----- > From: Beignet [mailto:[email protected]] On Behalf Of > Zhigang Gong > Sent: Monday, December 15, 2014 09:18 > To: Gong, Zhigang > Cc: [email protected] > Subject: Re: [Beignet] [PATCH] Update optimization tips. > > Ping for review. > > On Fri, Dec 05, 2014 at 04:15:03PM +0800, Zhigang Gong wrote: > > Signed-off-by: Zhigang Gong <[email protected]> > > --- > > docs/optimization-guide.mdwn | 106 > > +++++++++++++++++++++++++++++++++++++------ > > 1 file changed, 92 insertions(+), 14 deletions(-) > > > > diff --git a/docs/optimization-guide.mdwn > > b/docs/optimization-guide.mdwn index 8fb29a6..5f648fb 100644 > > --- a/docs/optimization-guide.mdwn > > +++ b/docs/optimization-guide.mdwn > > @@ -1,28 +1,106 @@ > > Optimization Guide > > ==================== > > > > -All the SIMD optimization principle also apply to Beignet optimization. > > -Furthermore, there are some special tips for Beignet optimization. > > +All the SIMD optimization principles such as avoid branching and > > +don't waste SIMD lanes are also applied to Beignet optimization on > > +Gen platform. Furthermore, there are some special tips for Beignet > optimization. > > > > 1. It is recommended to choose multiple of 16 work group size. Too much > SLM usage may reduce parallelism at group level. > > If kernel uses large amount SLM, it's better to choose large work group > size. Please refer the following table for recommendations > > - with some SLM usage. > > + with some SLM usage. > > + > > | Amount of SLM | 0 | 4K | 8K | 16K | 32K | > > -| WorkGroup size| 16 | 64 | 128 | 256 | 512 | > > +| WorkGroup size| 16 | 64 | 128 | 256 | 512 | > > + > > + Actually, a good method is to pass in a NULL local work size parameter > > to > let the driver to determine the best work group size for you. > > + > > +1. Use shorter data type could get better performance. There are also > some detail tips as below. > > + 1. Use uchar16/ushort8/uint4 as much as possible. > > + 1. If the data has to be DWORD(4 bytes) unaligned, it's better to use > vload16(for char), vload8(for short) to load the data. > > + 1. Read/write scalar char/short will be very slow and may lead to be even > worse performance than use DW type. > > + > > +1. Avoid too strided global/constant memory access. > > + > > + Some examples are as below (assume the data is a cache line aligned > global/constant uint buffer, and the work group size is 16 with SIMD16 mode): > > + `uint x = data[get_global_id(0)]; //best, only read one cache line, > > no > bandwidth waste` > > + `uint x = data[get_global_id(0) + 1]; //bad, cross 2 cache lines, > > + not good, waste half of the bandwidth` `uint x = > > + data[get_global_id(0) * 16]; //worst, cross 16 cache lines, waste > > + 15/16 bandwidth.` > > + > > +1. Avoid dynamic indexed private buffer if possible. > > + Currently, private buffer access in beignet backend is very slow. Many > small private buffer could be optimized by the compiler. > > + But the following type of dynamic indexed private buffer could not be > optimized: > > + > > +` > > + uint private_buffer[32]; > > + for (i = 0; i < xid; i++) { > > + int dynamic_idx = src[xid]; > > + private_buffer[dynamic_idx % 10] = src1[xid]; > > + ... > > + } > > +` > > + > > + The following case is OK. > > + > > +` > > + ... > > + uint private_buffer[32]; > > + for (i = 0; i < xid; i++) { > > + private_buffer[xid % 32] = src1[xid]; > > + ... > > + } > > +` > > + > > + > > +1. Use SLM to reduce the memory bandwidth requirement if possible. > > + > > + On Gen platform, SLM is in GPU's L3 cache, if it could be used to > > + share data between work items, it could reduce the memory bandwidth > > + on the system memory bus. This will be a big win for many I/O > > + intensity kernels. > > + > > +1. Avoid SLM bank conflicts. > > + > > + SLM is banked at a DWORD granularity, totally 16 banks. Access on > > + the same bank with different addresses will lead to a conflicts. It > > should > be avoided. > > + The worst case is: > > + > > + Some examples are as below (assume the data is a cache line aligned > global/constant uint buffer, and the work group size is 16 with SIMD16 mode): > > + `uint x = data[get_global_id(0)]; //best, no bank conflicts, no > bandwidth waste` > > + `uint x = data[get_global_id(0) + 1]; //best, no bank conflicts, no > > + bandwidth waste` `uint x = data[get_global_id(0) * 2]; //bad, work > > + item (id) and (id + 8) conflict to each other, waste half of the > > + bandwidth` `uint x = data[get_global_id(0) * 16]; //worst, all work > > + items conflicts on the zero bank, waste 15/16 bandwidth.` > > + > > +1. Zero copy on buffer creation. (Only avaliable in git master branch and > Release\_v1.0 branch). > > + > > + Use CL\_MEM\_USE\_HOST\_PTR to create buffer, and pass in a page > > + aligned host pointer which also has multiple page size. Beignet > > + will leverage userptr to create a buffer object by using that host > > + buffer directly. If possible, you can also use > > + CL\_MEM\_ALLOC\_HOST\_PTR flag to let the driver to allocate a > > + userptr qualified buffer which could guarantee zero copy on the buffer. > > + > > + Please be noted, this feature requires the kernel is newer than 3.16 and > the libdrm version is newer than 2.4.57. > > + > > +1. Use float data type as much as possible. > > + > > + The two ALUs of one EU could both handle float data,but only one of > them could handle non-float type data. > > + > > +1. Avoid using long. > > + > > + GEN7 and Gen7.5 doesn't support long natively. And Gen8's native long > support is still under development. > > + > > +1. Declare small constant buffer with content in the kernel if possible. > > > > -2. GEN7's read/write on global memory with DWORD and DWORD4 are > significantly faster than read/write on BYTE/WORD. > > - Use DWORD or DWORD4 to access data in global memory if possible. If > you cannot avoid the byte/word access, try to do it on SLM. > > + For a small constant buffer, it's better to declare it in the kernel > > directly > with "const \_\_constant". The compiler may optimize it if the buffer is > defined inside kernel. > > > > -3. Use float data type as much as possible. > > +1. Avoid unnecessary synchronizations. > > > > -4. Avoid using long. GEN7's performance for long integer is poor. > > + Both in the runtime and in the kernel. For examples, clFinish and > clWaitForEvents in runtime and barrier() in the kernel. > > > > -5. If there is a small constant buffer, define it in the kernel instead of > > using > the constant buffer argument if possible. > > - The compiler may optimize it if the buffer is defined inside kernel. > > +1. Consider native version of math built-ins, such as native\_sin, > native\_cos, if your kernel is not precision sensitive. > > > > -6. Avoid unnecessary synchronizations, both in the runtime and in the > kernel. For examples, clFinish and clWaitForEvents in runtime > > - and barrier() in the kernel. > > +1. Use fma()/mad() as much as possible. > > > > -7. Consider native version of math built-ins, such as native\_sin, > native\_cos, if your kernel is not precision sensitive. > > +1. Try to eliminate branching as much as possible. > > > > -8. Try to eliminate branching as much as possible. For example using min, > max, clamp or select built-ins instead of if/else if possible. > > + For example using min, max, clamp or select built-ins instead of if/else > > if > possible. > > -- > > 1.8.3.2 > > > > _______________________________________________ > > Beignet mailing list > > [email protected] > > http://lists.freedesktop.org/mailman/listinfo/beignet > _______________________________________________ > Beignet mailing list > [email protected] > http://lists.freedesktop.org/mailman/listinfo/beignet _______________________________________________ Beignet mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/beignet
