On Fri, Aug 23, 2013 at 02:16:35PM +0200, Richard Biener wrote:
> Xinliang David Li <davi...@google.com> wrote:
> >Interesting idea!
> 
> In the past have already arranged for re-use of the epilogue loop and the 
> scalar loop, so the situation was even worse.
> 
> Note that re-use prevents complete peeling of the epilogue which is often 
> profitable.  Combining the prologue will introduce a mispredicted branch 
> which can be harmful.
> 
> So, certainly interesting but not easily always profitable.
> 
When we are about mispredicted branches test following:

 void foo2 (TYPE * a, TYPE* b, TYPE * c, int n)
 {
   int i, m, next;
   __m128 veca, vecb, vecc;

   i = 0;

   if (n>=4 & (b >= a+4 | b+4 <= a) &
       (c >= a+4 | c+4 <= a))
   {
     vecb = _mm_loadu_ps(b);
     vecc = _mm_loadu_ps(c);
     veca = _mm_mul_ps(vecb, vecc);
     _mm_storeu_ps(a, veca);

     for (i=4; i < n; i+=4)
     {
       vecb = _mm_loadu_ps(b+i);
       vecc = _mm_loadu_ps(c+i);
       veca = _mm_mul_ps(vecb, vecc);
       _mm_store_ps(a+i, veca);
     }
     vecb = _mm_loadu_ps(b+n-4);
     vecc = _mm_loadu_ps(c+n-4);
     veca = _mm_mul_ps(vecb, vecc);
     _mm_storeu_ps(a+n-4, veca);

     return;
   }
   for (i=0; i < n; i++)
     a[i] = b[i] * c[i];
 }


> >>
> >>
> >> thanks,
> >>
> >> Cong
> >>
> >>
> >> On Wed, Aug 21, 2013 at 11:50 PM, Xinliang David Li
> ><davi...@google.com>
> >> wrote:
> >>>
> >>> > The effect on runtime is not correlated to
> >>> > either (which means the vectorizer cost model is rather bad), but
> >>> > integer
> >>> > code usually does not benefit at all.
> >>>
> >>> The cost model does need some tuning. For instance, GCC vectorizer
> >>> does peeling aggressively, but  peeling in many cases can be avoided
> >>> while still gaining good performance -- even when target does not
> >have
> >>> efficient unaligned load/store to implement unaligned access. GCC
> >>> reports too high cost for unaligned access while too low for peeling
> >>> overhead.
> >>>
> >>> Example:
> >>>
> >>> ifndef TYPE
> >>> #define TYPE float
> >>> #endif
> >>> #include <stdlib.h>
> >>>
> >>> __attribute__((noinline)) void
> >>> foo (TYPE *a, TYPE* b, TYPE *c, int n)
> >>> {
> >>>    int i;
> >>>    for ( i = 0; i < n; i++)
> >>>      a[i] = b[i] * c[i];
> >>> }
> >>>
> >>> int g;
> >>> int
> >>> main()
> >>> {
> >>>    int i;
> >>>    float *a = (float*) malloc (100000*4);
> >>>    float *b = (float*) malloc (100000*4);
> >>>    float *c = (float*) malloc (100000*4);
> >>>
> >>>    for (i = 0; i < 100000; i++)
> >>>       foo(a, b, c, 100000);
> >>>
> >>>
> >>>    g = a[10];
> >>>
> >>> }
> >>>
> >>>
> >>> 1) by default, GCC's vectorizer will peel the loop in foo, so that
> >>> access to 'a' is aligned and using movaps instruction. The other
> >>> accesses are using movups when -march=corei7 is used
> >>> 2) Same as above, but -march=x86_64. Access to b is split into
> >'movlps
> >>> and movhps', same for 'c'
> >>>
> >>> 3) Disabling peeling (via a hack) with -march=corei7 --- all three
> >>> accesses are using movups
> >>> 4) Disabling peeling, with -march=x86-64 -- all three accesses are
> >>> using movlps/movhps
> >>>
> >>> Performance:
> >>>
> >>> 1) and 3) -- both 1.58s, but 3) is much smaller than 1). 3)'s text
> >is
> >>> 1462 bytes, and 1) is 1622 bytes
> >>> 3) and 4) and no vectorize -- all very slow -- 4.8s
> >>>
> >>> Observations:
> >>> a)  if properly tuned, for corei7, 3) should be picked by GCC
> >instead
> >>> of 1) -- this is not possible today
> >>> b) with march=x86_64, GCC should figure out the benefit of
> >vectorizing
> >>> the loop is small and bail out
> >>>
> >>> >> On the other hand, 10% compile time increase due to one pass
> >sounds
> >>> >> excessive -- there might be some low hanging fruit to reduce the
> >>> >> compile time increase.
> >>> >
> >>> > I have already spent two man-month speeding up the vectorizer
> >itself,
> >>> > I don't think there is any low-hanging fruit left there.  But see
> >above
> >>> > - most
> >>> > of the compile-time is due to the cost of processing the extra
> >loop
> >>> > copies.
> >>> >
> >>>
> >>> Ok.
> >>>
> >>> I did not notice your patch (in May this year) until recently. Do
> >you
> >>> plan to check it in (other than the part to turn in at O2). The cost
> >>> model part of the changes are largely independent. If it is in, it
> >>> will serve as a good basis for further tuning.
> >>>
> >>>
> >>> >>  at full feature set vectorization regresses runtime of quite a
> >number
> >>> >> of benchmarks significantly. At reduced feature set - basically
> >trying
> >>> >> to vectorize only obvious profitable cases - these regressions
> >can be
> >>> >> avoided but progressions only remain on two spec fp cases. As
> >most
> >>> >> user applications fall into the spec int category a 10%
> >compile-time
> >>> >> and 15% code-size regression for no gain is no good.
> >>> >>>
> >>> >>
> >>> >> Cong's data (especially corei7 and corei7avx) shows more
> >significant
> >>> >> performance improvement.   If 10% compile time increase is across
> >the
> >>> >> board and happens on benchmarks with no performance improvement,
> >it is
> >>> >> certainly bad - but I am not sure if that is the case.
> >>> >
> >>> > Note that we are talking about -O2 - people that enable
> >-march=corei7
> >>> > usually
> >>> > know to use -O3 or FDO anyway.
> >>>
> >>> Many people uses FDO, but not all -- there are still some barriers
> >for
> >>> adoption. There are reasons people may not want to use O3:
> >>> 1) people feel most comfortable to use O2 because it is considered
> >the
> >>> most thoroughly tested compiler optimization level;  Going with the
> >>> default is the natural choice. FDO is a different beast as the
> >>> performance benefit can be too high to resist;
> >>> 2) In a distributed build environment with object file
> >>> caching/sharing, building with O3 (different from the default) leads
> >>> to longer build time;
> >>> 3) The size/compile time cost can be too high with O3. On the other
> >>> hand, the benefit of vectorizer can be very high for many types of
> >>> applications such as image processing, stitching, image detection,
> >>> dsp, encoder/decoder -- other than numerical fortran programs.
> >>>
> >>>
> >>> > That said, I expect 99% of used software
> >>> > (probably rather 99,99999%) is not compiled on the system it runs
> >on but
> >>> > compiled to run on generic hardware and thus restricts itself to
> >bare
> >>> > x86_64
> >>> > SSE2 features.  So what matters for enabling the vectorizer at -O2
> >is
> >>> > the
> >>> > default architecture features of the given architecture(!) -
> >remember
> >>> > to not only
> >>> > consider x86 here!
> >>> >
> >>> >> A couple of points I'd like to make:
> >>> >>
> >>> >> 1) loop vectorizer passes the quality threshold to be turned on
> >by
> >>> >> default at O2 in 4.9; It is already turned on for FDO at O2.
> >>> >
> >>> > With FDO we have a _much_ better way of reasoning on which loops
> >>> > we spend the compile-time and code-size!  Exactly the problem that
> >>> > exists without FDO at -O2 (and also at -O3, but -O3 is not said to
> >>> > be well-balanced with regard to compile-time and code-size)
> >>> >
> >>> >> 2) there are still lots of room for improvement for loop
> >vectorizer --
> >>> >> there is no doubt about it, and we will need to continue
> >improving it;
> >>> >
> >>> > I believe we have to first do that.  See the patches regarding to
> >the
> >>> > cost model reorg I posted with the proposal for enabling
> >vectorization
> >>> > at -O2.
> >>> > One large source of collateral damage of vectorization is
> >if-conversion
> >>> > which
> >>> > aggressively if-converts loops regardless of us later vectorizing
> >the
> >>> > result.
> >>> > The if-conversion pass needs to be integrated with vectorization.
> >>>
> >>> We notice some small performance problems with tree-if conversion
> >that
> >>> is turned on with FDO -- because that pass does not have  cost model
> >>> (by looking at branch probability as rtl level if-cvt). What other
> >>> problems do you see? is it just compile time concern?
> >>>
> >>> >
> >>> >> 3) the only fast way to improve a feature is to get it used
> >widely so
> >>> >> that people can file bugs and report problems -- it is hard for
> >>> >> developers to find and collect all cases where GCC is weak
> >without GCC
> >>> >> community's help; There might be a temporary regression for some
> >>> >> users, but it is worth the pain
> >>> >
> >>> > Well, introducing known regressions at -O2 is not how this works.
> >>> > Vectorization is already widely tested and you can look at a
> >plethora of
> >>> > bugreports about missed features and vectorizer wrong-doings to
> >improve
> >>> > it.
> >>> >
> >>> >> 4) Not the most important one, but a practical concern:  without
> >>> >> turning it on, GCC will be greatly disadvantaged when people
> >start
> >>> >> doing benchmarking latest GCC against other compilers ..
> >>> >
> >>> > The same argument was done on the fact that GCC does not optimize
> >by
> >>> > default
> >>> > but uses -O0.  It's a straw-mans argument.  All "benchmarking" I
> >see
> >>> > uses
> >>> > -O3 or -Ofast already.
> >>>
> >>> People can just do -O2 performance comparison.
> >>>
> >>> thanks,
> >>>
> >>> David
> >>>
> >>> > To make vectorization have a bigger impact on day-to-day software
> >GCC
> >>> > would need
> >>> > to start versioning for the target sub-architecture - which of
> >course
> >>> > increases the
> >>> > issue with code-size and compile-time.
> >>> >
> >>> > Richard.
> >>> >
> >>> >> thanks,
> >>> >>
> >>> >> David
> >>> >>
> >>> >>
> >>> >>
> >>> >>> Richard.
> >>> >>>
> >>> >>>>thanks,
> >>> >>>>
> >>> >>>>David
> >>> >>>>
> >>> >>>>
> >>> >>>>>
> >>> >>>>> Richard.
> >>> >>>>>
> >>> >>>>>>>
> >>> >>>>>>> Vectorization has great performance potential -- the more
> >people
> >>> >>>>use
> >>> >>>>>>> it, the likely it will be further improved -- turning it on
> >at O2
> >>> >>>>is
> >>> >>>>>>> the way to go ...
> >>> >>>>>>>
> >>> >>>>>>>
> >>> >>>>>>> Thank you!
> >>> >>>>>>>
> >>> >>>>>>>
> >>> >>>>>>> Cong Hou
> >>> >>>>>
> >>> >>>>>
> >>> >>>
> >>> >>>
> >>
> >>
> 

-- 

Recursive traversal of loopback mount points

Reply via email to