On Tue, Aug 20, 2013 at 6:38 PM, Xinliang David Li <davi...@google.com> wrote: > On Tue, Aug 20, 2013 at 3:59 AM, Richard Biener > <richard.guent...@gmail.com> wrote: >> Xinliang David Li <davi...@google.com> wrote: >>>On Mon, Aug 19, 2013 at 11:53 AM, Richard Biener >>><richard.guent...@gmail.com> wrote: >>>> Xinliang David Li <davi...@google.com> wrote: >>>>>+cc auto-vectorizer maintainers. >>>>> >>>>>David >>>>> >>>>>On Mon, Aug 19, 2013 at 10:37 AM, Cong Hou <co...@google.com> wrote: >>>>>> Nowadays, SIMD instructions play more and more important roles in >>>our >>>>>> daily computations. AVX and AVX2 have extended 128-bit registers to >>>>>> 256-bit ones, and the newly announced AVX-512 further doubles the >>>>>> size. The benefit we can get from vectorization will be larger and >>>>>> larger. This is also a common practice in other compilers: >>>>>> >>>>>> 1) Intel's ICC turns on vectorizer at O2 by default and it has been >>>>>> the case for many years; >>>>>> >>>>>> 2) Most recently, LLVM turns it on for both O2 and Os. >>>>>> >>>>>> >>>>>> Here we propose moving vectorization from -O3 to -O2 in GCC. Three >>>>>> main concerns about this change are: 1. Does vectorization greatly >>>>>> increase the generated code size? 2. How much performance can be >>>>>> improved? 3. Does vectorization increase compile time >>>significantly? >>>>>> >>>>>> >>>>>> I have fixed GCC bootstrap failure with vectorizer turned on >>>>>> (http://gcc.gnu.org/ml/gcc-patches/2013-07/msg00497.html). To >>>>>evaluate >>>>>> the size and performance impact, experiments on SPEC06 and internal >>>>>> benchmarks are done. Based on the data, I have tuned the parameters >>>>>> for vectorizer which reduces the code bloat without sacrificing the >>>>>> performance gain. There are some performance regressions in SPEC06, >>>>>> and the root cause has been analyzed and understood. I will file >>>bugs >>>>>> tracking them independently. The experiments failed on three >>>>>> benchmarks (please refer to >>>>>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56993). The experiment >>>>>> result is attached here as two pdf files. Below are our summaries >>>of >>>>>> the result: >>>>>> >>>>>> >>>>>> 1) We noticed that vectorization could increase the generated code >>>>>> size, so we tried to suppress this problem by doing some tunings, >>>>>> which include setting a higher loop bound so that loops with small >>>>>> iterations won't be vectorized, and disabling loop versioning. The >>>>>> average size increase is decreased from 9.84% to 7.08% after our >>>>>> tunings (13.93% to 10.75% for Fortran benchmarks, and 3.55% to >>>1.44% >>>>>> for C/C++ benchmarks). The code size increase for Fortran >>>benchmarks >>>>>> can be significant (from 18.72% to 34.15%), but the performance >>>gain >>>>>> is also huge. Hence we think this size increase is reasonable. For >>>>>> C/C++ benchmarks, the size increase is very small (below 3% except >>>>>> 447.dealII). >>>>>> >>>>>> >>>>>> 2) Vectorization improves the performance for most benchmarks by >>>>>> around 2.5%-3% on average, and much more for Fortran benchmarks. On >>>>>> Sandybridge machines, the improvement can be more if using >>>>>> -march=corei7 (3.27% on average) and -march=corei7-avx (4.81% on >>>>>> average) (Please see the attachment for details). We also noticed >>>>>that >>>>>> some performance degrades exist, and after investigation, we found >>>>>> some are caused by the defects of GCC's vectorization (e.g. GCC's >>>SLP >>>>>> could not vectorize a group of accesses if the number of group >>>cannot >>>>>> be divided by VF http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49955, >>>>>> and any data dependence between statements can prevent >>>>>vectorization), >>>>>> which can be resolved in the future. >>>>>> >>>>>> >>>>>> 3) As last, we found that introducing vectorization almost does not >>>>>> affect the build time. GCC bootstrap time increase is negligible. >>>>>> >>>>>> >>>>>> As a reference, Richard Biener is also proposing to move >>>>>vectorization >>>>>> to O2 by improving the cost model >>>>>> (http://gcc.gnu.org/ml/gcc-patches/2013-05/msg00904.html). >>>> >>>> And my conclusion is that we are not ready for this. The compile >>>time cost does not outweigh the benefit. >>> >>>Can you elaborate on your reasoning ? >> >> I have done measurements with spec 2006 and selective turning on parts of >> the vectorizer at O2. vectorizing has both a compile-time (around 10%) and >> code-size (up to 15%) impact. > > Cong only did some compile time measurement with GCC bootstrap -- and > the impact is very small. He can confirm the compile time impact with > his tuning. > > From Cong's data, benchmarks with large size increase also comes with > huge performance improvement: > > o catatusADM -- size increases 18.7%, performance improves 37.5% > o leslie3d -- size increases 34.15%, performance improves 29.4% > .. etc. > > CPU savings here are much larger than the cost due to small ram > increases in text. For applications that really care about size, Os > should be used anyway. > > For the compile time increase, do you see similar pattern -- i.e., > large compile time increase --> large performance improvement?
The correlation is rather large code size increase -> large compile time increase. Most of the time is not spent in the vectorizer but in subsequent loop optimization passes. The effect on runtime is not correlated to either (which means the vectorizer cost model is rather bad), but integer code usually does not benefit at all. > On the other hand, 10% compile time increase due to one pass sounds > excessive -- there might be some low hanging fruit to reduce the > compile time increase. I have already spent two man-month speeding up the vectorizer itself, I don't think there is any low-hanging fruit left there. But see above - most of the compile-time is due to the cost of processing the extra loop copies. > at full feature set vectorization regresses runtime of quite a number > of benchmarks significantly. At reduced feature set - basically trying > to vectorize only obvious profitable cases - these regressions can be > avoided but progressions only remain on two spec fp cases. As most > user applications fall into the spec int category a 10% compile-time > and 15% code-size regression for no gain is no good. >> > > Cong's data (especially corei7 and corei7avx) shows more significant > performance improvement. If 10% compile time increase is across the > board and happens on benchmarks with no performance improvement, it is > certainly bad - but I am not sure if that is the case. Note that we are talking about -O2 - people that enable -march=corei7 usually know to use -O3 or FDO anyway. That said, I expect 99% of used software (probably rather 99,99999%) is not compiled on the system it runs on but compiled to run on generic hardware and thus restricts itself to bare x86_64 SSE2 features. So what matters for enabling the vectorizer at -O2 is the default architecture features of the given architecture(!) - remember to not only consider x86 here! > A couple of points I'd like to make: > > 1) loop vectorizer passes the quality threshold to be turned on by > default at O2 in 4.9; It is already turned on for FDO at O2. With FDO we have a _much_ better way of reasoning on which loops we spend the compile-time and code-size! Exactly the problem that exists without FDO at -O2 (and also at -O3, but -O3 is not said to be well-balanced with regard to compile-time and code-size) > 2) there are still lots of room for improvement for loop vectorizer -- > there is no doubt about it, and we will need to continue improving it; I believe we have to first do that. See the patches regarding to the cost model reorg I posted with the proposal for enabling vectorization at -O2. One large source of collateral damage of vectorization is if-conversion which aggressively if-converts loops regardless of us later vectorizing the result. The if-conversion pass needs to be integrated with vectorization. > 3) the only fast way to improve a feature is to get it used widely so > that people can file bugs and report problems -- it is hard for > developers to find and collect all cases where GCC is weak without GCC > community's help; There might be a temporary regression for some > users, but it is worth the pain Well, introducing known regressions at -O2 is not how this works. Vectorization is already widely tested and you can look at a plethora of bugreports about missed features and vectorizer wrong-doings to improve it. > 4) Not the most important one, but a practical concern: without > turning it on, GCC will be greatly disadvantaged when people start > doing benchmarking latest GCC against other compilers .. The same argument was done on the fact that GCC does not optimize by default but uses -O0. It's a straw-mans argument. All "benchmarking" I see uses -O3 or -Ofast already. To make vectorization have a bigger impact on day-to-day software GCC would need to start versioning for the target sub-architecture - which of course increases the issue with code-size and compile-time. Richard. > thanks, > > David > > > >> Richard. >> >>>thanks, >>> >>>David >>> >>> >>>> >>>> Richard. >>>> >>>>>> >>>>>> Vectorization has great performance potential -- the more people >>>use >>>>>> it, the likely it will be further improved -- turning it on at O2 >>>is >>>>>> the way to go ... >>>>>> >>>>>> >>>>>> Thank you! >>>>>> >>>>>> >>>>>> Cong Hou >>>> >>>> >> >>