On Fri, Aug 23, 2013 at 02:16:35PM +0200, Richard Biener wrote: > Xinliang David Li <davi...@google.com> wrote: > >Interesting idea! > > In the past have already arranged for re-use of the epilogue loop and the > scalar loop, so the situation was even worse. > > Note that re-use prevents complete peeling of the epilogue which is often > profitable. Combining the prologue will introduce a mispredicted branch > which can be harmful. > > So, certainly interesting but not easily always profitable. > When we are about mispredicted branches test following:
void foo2 (TYPE * a, TYPE* b, TYPE * c, int n) { int i, m, next; __m128 veca, vecb, vecc; i = 0; if (n>=4 & (b >= a+4 | b+4 <= a) & (c >= a+4 | c+4 <= a)) { vecb = _mm_loadu_ps(b); vecc = _mm_loadu_ps(c); veca = _mm_mul_ps(vecb, vecc); _mm_storeu_ps(a, veca); for (i=4; i < n; i+=4) { vecb = _mm_loadu_ps(b+i); vecc = _mm_loadu_ps(c+i); veca = _mm_mul_ps(vecb, vecc); _mm_store_ps(a+i, veca); } vecb = _mm_loadu_ps(b+n-4); vecc = _mm_loadu_ps(c+n-4); veca = _mm_mul_ps(vecb, vecc); _mm_storeu_ps(a+n-4, veca); return; } for (i=0; i < n; i++) a[i] = b[i] * c[i]; } > >> > >> > >> thanks, > >> > >> Cong > >> > >> > >> On Wed, Aug 21, 2013 at 11:50 PM, Xinliang David Li > ><davi...@google.com> > >> wrote: > >>> > >>> > The effect on runtime is not correlated to > >>> > either (which means the vectorizer cost model is rather bad), but > >>> > integer > >>> > code usually does not benefit at all. > >>> > >>> The cost model does need some tuning. For instance, GCC vectorizer > >>> does peeling aggressively, but peeling in many cases can be avoided > >>> while still gaining good performance -- even when target does not > >have > >>> efficient unaligned load/store to implement unaligned access. GCC > >>> reports too high cost for unaligned access while too low for peeling > >>> overhead. > >>> > >>> Example: > >>> > >>> ifndef TYPE > >>> #define TYPE float > >>> #endif > >>> #include <stdlib.h> > >>> > >>> __attribute__((noinline)) void > >>> foo (TYPE *a, TYPE* b, TYPE *c, int n) > >>> { > >>> int i; > >>> for ( i = 0; i < n; i++) > >>> a[i] = b[i] * c[i]; > >>> } > >>> > >>> int g; > >>> int > >>> main() > >>> { > >>> int i; > >>> float *a = (float*) malloc (100000*4); > >>> float *b = (float*) malloc (100000*4); > >>> float *c = (float*) malloc (100000*4); > >>> > >>> for (i = 0; i < 100000; i++) > >>> foo(a, b, c, 100000); > >>> > >>> > >>> g = a[10]; > >>> > >>> } > >>> > >>> > >>> 1) by default, GCC's vectorizer will peel the loop in foo, so that > >>> access to 'a' is aligned and using movaps instruction. The other > >>> accesses are using movups when -march=corei7 is used > >>> 2) Same as above, but -march=x86_64. Access to b is split into > >'movlps > >>> and movhps', same for 'c' > >>> > >>> 3) Disabling peeling (via a hack) with -march=corei7 --- all three > >>> accesses are using movups > >>> 4) Disabling peeling, with -march=x86-64 -- all three accesses are > >>> using movlps/movhps > >>> > >>> Performance: > >>> > >>> 1) and 3) -- both 1.58s, but 3) is much smaller than 1). 3)'s text > >is > >>> 1462 bytes, and 1) is 1622 bytes > >>> 3) and 4) and no vectorize -- all very slow -- 4.8s > >>> > >>> Observations: > >>> a) if properly tuned, for corei7, 3) should be picked by GCC > >instead > >>> of 1) -- this is not possible today > >>> b) with march=x86_64, GCC should figure out the benefit of > >vectorizing > >>> the loop is small and bail out > >>> > >>> >> On the other hand, 10% compile time increase due to one pass > >sounds > >>> >> excessive -- there might be some low hanging fruit to reduce the > >>> >> compile time increase. > >>> > > >>> > I have already spent two man-month speeding up the vectorizer > >itself, > >>> > I don't think there is any low-hanging fruit left there. But see > >above > >>> > - most > >>> > of the compile-time is due to the cost of processing the extra > >loop > >>> > copies. > >>> > > >>> > >>> Ok. > >>> > >>> I did not notice your patch (in May this year) until recently. Do > >you > >>> plan to check it in (other than the part to turn in at O2). The cost > >>> model part of the changes are largely independent. If it is in, it > >>> will serve as a good basis for further tuning. > >>> > >>> > >>> >> at full feature set vectorization regresses runtime of quite a > >number > >>> >> of benchmarks significantly. At reduced feature set - basically > >trying > >>> >> to vectorize only obvious profitable cases - these regressions > >can be > >>> >> avoided but progressions only remain on two spec fp cases. As > >most > >>> >> user applications fall into the spec int category a 10% > >compile-time > >>> >> and 15% code-size regression for no gain is no good. > >>> >>> > >>> >> > >>> >> Cong's data (especially corei7 and corei7avx) shows more > >significant > >>> >> performance improvement. If 10% compile time increase is across > >the > >>> >> board and happens on benchmarks with no performance improvement, > >it is > >>> >> certainly bad - but I am not sure if that is the case. > >>> > > >>> > Note that we are talking about -O2 - people that enable > >-march=corei7 > >>> > usually > >>> > know to use -O3 or FDO anyway. > >>> > >>> Many people uses FDO, but not all -- there are still some barriers > >for > >>> adoption. There are reasons people may not want to use O3: > >>> 1) people feel most comfortable to use O2 because it is considered > >the > >>> most thoroughly tested compiler optimization level; Going with the > >>> default is the natural choice. FDO is a different beast as the > >>> performance benefit can be too high to resist; > >>> 2) In a distributed build environment with object file > >>> caching/sharing, building with O3 (different from the default) leads > >>> to longer build time; > >>> 3) The size/compile time cost can be too high with O3. On the other > >>> hand, the benefit of vectorizer can be very high for many types of > >>> applications such as image processing, stitching, image detection, > >>> dsp, encoder/decoder -- other than numerical fortran programs. > >>> > >>> > >>> > That said, I expect 99% of used software > >>> > (probably rather 99,99999%) is not compiled on the system it runs > >on but > >>> > compiled to run on generic hardware and thus restricts itself to > >bare > >>> > x86_64 > >>> > SSE2 features. So what matters for enabling the vectorizer at -O2 > >is > >>> > the > >>> > default architecture features of the given architecture(!) - > >remember > >>> > to not only > >>> > consider x86 here! > >>> > > >>> >> A couple of points I'd like to make: > >>> >> > >>> >> 1) loop vectorizer passes the quality threshold to be turned on > >by > >>> >> default at O2 in 4.9; It is already turned on for FDO at O2. > >>> > > >>> > With FDO we have a _much_ better way of reasoning on which loops > >>> > we spend the compile-time and code-size! Exactly the problem that > >>> > exists without FDO at -O2 (and also at -O3, but -O3 is not said to > >>> > be well-balanced with regard to compile-time and code-size) > >>> > > >>> >> 2) there are still lots of room for improvement for loop > >vectorizer -- > >>> >> there is no doubt about it, and we will need to continue > >improving it; > >>> > > >>> > I believe we have to first do that. See the patches regarding to > >the > >>> > cost model reorg I posted with the proposal for enabling > >vectorization > >>> > at -O2. > >>> > One large source of collateral damage of vectorization is > >if-conversion > >>> > which > >>> > aggressively if-converts loops regardless of us later vectorizing > >the > >>> > result. > >>> > The if-conversion pass needs to be integrated with vectorization. > >>> > >>> We notice some small performance problems with tree-if conversion > >that > >>> is turned on with FDO -- because that pass does not have cost model > >>> (by looking at branch probability as rtl level if-cvt). What other > >>> problems do you see? is it just compile time concern? > >>> > >>> > > >>> >> 3) the only fast way to improve a feature is to get it used > >widely so > >>> >> that people can file bugs and report problems -- it is hard for > >>> >> developers to find and collect all cases where GCC is weak > >without GCC > >>> >> community's help; There might be a temporary regression for some > >>> >> users, but it is worth the pain > >>> > > >>> > Well, introducing known regressions at -O2 is not how this works. > >>> > Vectorization is already widely tested and you can look at a > >plethora of > >>> > bugreports about missed features and vectorizer wrong-doings to > >improve > >>> > it. > >>> > > >>> >> 4) Not the most important one, but a practical concern: without > >>> >> turning it on, GCC will be greatly disadvantaged when people > >start > >>> >> doing benchmarking latest GCC against other compilers .. > >>> > > >>> > The same argument was done on the fact that GCC does not optimize > >by > >>> > default > >>> > but uses -O0. It's a straw-mans argument. All "benchmarking" I > >see > >>> > uses > >>> > -O3 or -Ofast already. > >>> > >>> People can just do -O2 performance comparison. > >>> > >>> thanks, > >>> > >>> David > >>> > >>> > To make vectorization have a bigger impact on day-to-day software > >GCC > >>> > would need > >>> > to start versioning for the target sub-architecture - which of > >course > >>> > increases the > >>> > issue with code-size and compile-time. > >>> > > >>> > Richard. > >>> > > >>> >> thanks, > >>> >> > >>> >> David > >>> >> > >>> >> > >>> >> > >>> >>> Richard. > >>> >>> > >>> >>>>thanks, > >>> >>>> > >>> >>>>David > >>> >>>> > >>> >>>> > >>> >>>>> > >>> >>>>> Richard. > >>> >>>>> > >>> >>>>>>> > >>> >>>>>>> Vectorization has great performance potential -- the more > >people > >>> >>>>use > >>> >>>>>>> it, the likely it will be further improved -- turning it on > >at O2 > >>> >>>>is > >>> >>>>>>> the way to go ... > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> Thank you! > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> Cong Hou > >>> >>>>> > >>> >>>>> > >>> >>> > >>> >>> > >> > >> > -- Recursive traversal of loopback mount points