Ping for review.
On Fri, Dec 05, 2014 at 04:15:03PM +0800, Zhigang Gong wrote: > Signed-off-by: Zhigang Gong <[email protected]> > --- > docs/optimization-guide.mdwn | 106 > +++++++++++++++++++++++++++++++++++++------ > 1 file changed, 92 insertions(+), 14 deletions(-) > > diff --git a/docs/optimization-guide.mdwn b/docs/optimization-guide.mdwn > index 8fb29a6..5f648fb 100644 > --- a/docs/optimization-guide.mdwn > +++ b/docs/optimization-guide.mdwn > @@ -1,28 +1,106 @@ > Optimization Guide > ==================== > > -All the SIMD optimization principle also apply to Beignet optimization. > -Furthermore, there are some special tips for Beignet optimization. > +All the SIMD optimization principles such as avoid branching and don't waste > +SIMD lanes are also applied to Beignet optimization on Gen platform. > Furthermore, > +there are some special tips for Beignet optimization. > > 1. It is recommended to choose multiple of 16 work group size. Too much SLM > usage may reduce parallelism at group level. > If kernel uses large amount SLM, it's better to choose large work group > size. Please refer the following table for recommendations > - with some SLM usage. > + with some SLM usage. > + > | Amount of SLM | 0 | 4K | 8K | 16K | 32K | > -| WorkGroup size| 16 | 64 | 128 | 256 | 512 | > +| WorkGroup size| 16 | 64 | 128 | 256 | 512 | > + > + Actually, a good method is to pass in a NULL local work size parameter to > let the driver to determine the best work group size for you. > + > +1. Use shorter data type could get better performance. There are also some > detail tips as below. > + 1. Use uchar16/ushort8/uint4 as much as possible. > + 1. If the data has to be DWORD(4 bytes) unaligned, it's better to use > vload16(for char), vload8(for short) to load the data. > + 1. Read/write scalar char/short will be very slow and may lead to be even > worse performance than use DW type. > + > +1. Avoid too strided global/constant memory access. > + > + Some examples are as below (assume the data is a cache line aligned > global/constant uint buffer, and the work group size is 16 with SIMD16 mode): > > + `uint x = data[get_global_id(0)]; //best, only read one cache line, > no bandwidth waste` > + `uint x = data[get_global_id(0) + 1]; //bad, cross 2 cache lines, not > good, waste half of the bandwidth` > + `uint x = data[get_global_id(0) * 16]; //worst, cross 16 cache lines, > waste 15/16 bandwidth.` > + > +1. Avoid dynamic indexed private buffer if possible. > + Currently, private buffer access in beignet backend is very slow. Many > small private buffer could be optimized by the compiler. > + But the following type of dynamic indexed private buffer could not be > optimized: > + > +` > + uint private_buffer[32]; > + for (i = 0; i < xid; i++) { > + int dynamic_idx = src[xid]; > + private_buffer[dynamic_idx % 10] = src1[xid]; > + ... > + } > +` > + > + The following case is OK. > + > +` > + ... > + uint private_buffer[32]; > + for (i = 0; i < xid; i++) { > + private_buffer[xid % 32] = src1[xid]; > + ... > + } > +` > + > + > +1. Use SLM to reduce the memory bandwidth requirement if possible. > + > + On Gen platform, SLM is in GPU's L3 cache, if it could be used to > + share data between work items, it could reduce the memory bandwidth > + on the system memory bus. This will be a big win for many I/O intensity > + kernels. > + > +1. Avoid SLM bank conflicts. > + > + SLM is banked at a DWORD granularity, totally 16 banks. Access on the same > + bank with different addresses will lead to a conflicts. It should be > avoided. > + The worst case is: > + > + Some examples are as below (assume the data is a cache line aligned > global/constant uint buffer, and the work group size is 16 with SIMD16 mode): > > + `uint x = data[get_global_id(0)]; //best, no bank conflicts, no > bandwidth waste` > + `uint x = data[get_global_id(0) + 1]; //best, no bank conflicts, no > bandwidth waste` > + `uint x = data[get_global_id(0) * 2]; //bad, work item (id) and (id + 8) > conflict to each other, waste half of the bandwidth` > + `uint x = data[get_global_id(0) * 16]; //worst, all work items conflicts > on the zero bank, waste 15/16 bandwidth.` > + > +1. Zero copy on buffer creation. (Only avaliable in git master branch and > Release\_v1.0 branch). > + > + Use CL\_MEM\_USE\_HOST\_PTR to create buffer, and pass in a page > + aligned host pointer which also has multiple page size. Beignet > + will leverage userptr to create a buffer object by using that > + host buffer directly. If possible, you can also use > CL\_MEM\_ALLOC\_HOST\_PTR > + flag to let the driver to allocate a userptr qualified buffer which could > + guarantee zero copy on the buffer. > + > + Please be noted, this feature requires the kernel is newer than 3.16 and > the libdrm version is newer than 2.4.57. > + > +1. Use float data type as much as possible. > + > + The two ALUs of one EU could both handle float data,but only one of them > could handle non-float type data. > + > +1. Avoid using long. > + > + GEN7 and Gen7.5 doesn't support long natively. And Gen8's native long > support is still under development. > + > +1. Declare small constant buffer with content in the kernel if possible. > > -2. GEN7's read/write on global memory with DWORD and DWORD4 are > significantly faster than read/write on BYTE/WORD. > - Use DWORD or DWORD4 to access data in global memory if possible. If you > cannot avoid the byte/word access, try to do it on SLM. > + For a small constant buffer, it's better to declare it in the kernel > directly with "const \_\_constant". The compiler may optimize it if the > buffer is defined inside kernel. > > -3. Use float data type as much as possible. > +1. Avoid unnecessary synchronizations. > > -4. Avoid using long. GEN7's performance for long integer is poor. > + Both in the runtime and in the kernel. For examples, clFinish and > clWaitForEvents in runtime and barrier() in the kernel. > > -5. If there is a small constant buffer, define it in the kernel instead of > using the constant buffer argument if possible. > - The compiler may optimize it if the buffer is defined inside kernel. > +1. Consider native version of math built-ins, such as native\_sin, > native\_cos, if your kernel is not precision sensitive. > > -6. Avoid unnecessary synchronizations, both in the runtime and in the > kernel. For examples, clFinish and clWaitForEvents in runtime > - and barrier() in the kernel. > +1. Use fma()/mad() as much as possible. > > -7. Consider native version of math built-ins, such as native\_sin, > native\_cos, if your kernel is not precision sensitive. > +1. Try to eliminate branching as much as possible. > > -8. Try to eliminate branching as much as possible. For example using min, > max, clamp or select built-ins instead of if/else if possible. > + For example using min, max, clamp or select built-ins instead of if/else > if possible. > -- > 1.8.3.2 > > _______________________________________________ > Beignet mailing list > [email protected] > http://lists.freedesktop.org/mailman/listinfo/beignet _______________________________________________ Beignet mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/beignet
