Re: How to specify multiple OSDIRNAME suffixes for multilib (Multilib usage with MPX)?

2013-08-22 Thread Ilya Enkovich
2013/8/21 Joseph S. Myers :
> On Mon, 12 Aug 2013, Terry Guo wrote:
>
>> them to linker. When there is only one compatible library, the linker can
>> find it by searching all paths, the whole thing can work. But when there are
>> more than one compatible libraries spread in different paths, I am not sure
>> it works. You can try it out.
>
> The linker definitely does not support specifying multiple sysroots, to be
> searched one after the other.
>
> Logically, what you want is for all paths for one compatible multilib -
> several different paths, both inside and outside any sysroot - to be
> searched before any paths from the next multilib are searched.  Not (paths
> of one type for all compatible multilibs, then paths of the next type for
> all compatible multilibs).

That is exactly what I've tried to implement. Use one suffix for all
paths, then another suffix etc. with the default paths in the end. It
seems to work OK but more tests are required. And I suppose it will
also require additional support in dynamic linker.

Ilya

>
> --
> Joseph S. Myers
> jos...@codesourcery.com


Re: Propose moving vectorization from -O3 to -O2.

2013-08-22 Thread Ondřej Bílka
On Wed, Aug 21, 2013 at 11:50:34PM -0700, Xinliang David Li wrote:
> > The effect on runtime is not correlated to
> > either (which means the vectorizer cost model is rather bad), but integer
> > code usually does not benefit at all.
> 
> The cost model does need some tuning. For instance, GCC vectorizer
> does peeling aggressively, but  peeling in many cases can be avoided
> while still gaining good performance -- even when target does not have
> efficient unaligned load/store to implement unaligned access. GCC
> reports too high cost for unaligned access while too low for peeling
> overhead.
>
Another issue is that gcc generates very ineffective headers. If I
change example with following line

foo(a+rand()%1, b+rand()%1, c+rand()%1, rand()%64);

then I get vectorizer regression of
gcc-4.7 -O3 x.c -o xa
versus
gcc-4.7 -O2 -funroll-all-loops x.c -o xb

> Example:
> 
> ifndef TYPE
> #define TYPE float
> #endif
> #include 
> 
> __attribute__((noinline)) void
> foo (TYPE *a, TYPE* b, TYPE *c, int n)
> {
>int i;
>for ( i = 0; i < n; i++)
>  a[i] = b[i] * c[i];
> }
> 
> int g;
> int
> main()
> {
>int i;
>float *a = (float*) malloc (10*4);
>float *b = (float*) malloc (10*4);
>float *c = (float*) malloc (10*4);
> 
>for (i = 0; i < 10; i++)
>   foo(a, b, c, 10);
> 
> 
>g = a[10];
> 
> }
> 
> 
> 1) by default, GCC's vectorizer will peel the loop in foo, so that
> access to 'a' is aligned and using movaps instruction. The other
> accesses are using movups when -march=corei7 is used
> 2) Same as above, but -march=x86_64. Access to b is split into 'movlps
> and movhps', same for 'c'
> 
> 3) Disabling peeling (via a hack) with -march=corei7 --- all three
> accesses are using movups
> 4) Disabling peeling, with -march=x86-64 -- all three accesses are
> using movlps/movhps
> 
> Performance:
> 
> 1) and 3) -- both 1.58s, but 3) is much smaller than 1). 3)'s text is
> 1462 bytes, and 1) is 1622 bytes
> 3) and 4) and no vectorize -- all very slow -- 4.8s
> 
This could be explained by lack of unrolling. When unrolling is enabled
a slowdown is only 20% over sse variant.
 
> > That said, I expect 99% of used software
> > (probably rather 99,9%) is not compiled on the system it runs on but
> > compiled to run on generic hardware and thus restricts itself to bare x86_64
> > SSE2 features.  So what matters for enabling the vectorizer at -O2 is the
> > default architecture features of the given architecture(!) - remember
> > to not only
> > consider x86 here!
> >
This is non-issue as sse2 already contains most of operations needed.
Performance improvement from additional ss* is minimal.

A performance improvements over sse2 could be with avx/avx2 but it would
vectorizer of avx is still severely lacking.

> > The same argument was done on the fact that GCC does not optimize by default
> > but uses -O0.  It's a straw-mans argument.  All "benchmarking" I see uses
> > -O3 or -Ofast already.
> 
> People can just do -O2 performance comparison.
>
When machines spend 95% of time in code compiled by gcc -O2 then
benchmarking should be done on -O2. 
With any other flags you will just get bunch of numbers which are not
very related to performance.




Re: PoC: Function Pointer Protection in C Programs

2013-08-22 Thread Ondřej Bílka
On Wed, Aug 21, 2013 at 07:04:58PM +0200, Stephen Röttger wrote:
> 
> > What is performance impact for program that just qsorts big array? It
> > looks like worst case scenario for me.
> 
> I just put together a quick test program that sorts an array of 10^6
> integers and stopped the execution time using "time". The results are as
> follows (+- 0,01s):
> 
> protection disabled, -O0:
> ./sort_nofpp_0  0,19s user 0,02s system 98% cpu 0,215 total
> 
> protection enabled, -O0
> ./sort_fpp_0  0,54s user 0,01s system 99% cpu 0,549 total
> 
> protection disabled, -O3
> ./sort_nofpp_3  0,15s user 0,01s system 98% cpu 0,157 total
> 
> protection enabled, -O3
> ./sort_fpp_3  0,51s user 0,00s system 99% cpu 0,511 total
> 
> So this makes quite a difference:
> 0,19s -> 0,54s
> 0,15s -> 0,51s

After bit of thought a loops with callback can be optimized by gcc.

It could be possible to teach CSE to rewrite

while(foo){
 check(p);
 (*p)(x,y,z);
}

into 

check(p);
while(foo){
 (*p)(x,y,z);
}



Re: [oss-security] PoC: Function Pointer Protection in C Programs

2013-08-22 Thread Stephen Röttger
> Your approach seems to have some slight similarities with -fvtable-verify:
> 
>
> Maybe some code sharing could be achieved?

Thanks for the hint, this project was actually a big inspiration for my
thesis and is part of my related work, although I made some mistakes in
its description.


Re: PoC: Function Pointer Protection in C Programs

2013-08-22 Thread Stephen Röttger
> After bit of thought a loops with callback can be optimized by gcc.
> 
> It could be possible to teach CSE to rewrite
> 
> while(foo){
>  check(p);
>  (*p)(x,y,z);
> }
> 
> into 
> 
> check(p);
> while(foo){
>  (*p)(x,y,z);
> }
> 

This might introduce security issues, if an attacker is able to
overwrite p during the execution of the loop.
For example if p is part of a dynamically allocated struct that has
already been freed and an attacker can reallocate the memory after the
first execution of the loop body, he would be able to bypass the check.
On the other hand, if p is stored on the stack, vulnerabilities allowing
to overwrite it, would likely also allow to overwrite saved return
addresses.


Re: XNEW and consorts

2013-08-22 Thread Richard Biener
Gabriel Dos Reis  wrote:
>Hi,
>
>Now that we have transitioned to C++, do we still need to use
>placebo like XNEW and XNEWVEC in GCC source code proper?
>(I am not talking about uses in liberty.)
>
>Note that XNEW in particular does not work for types with
>non-default constructors.
>
>We introduced these macros so that they take care of casts
>that were required for going from void* to T*.  A new-expression
>automatically gives a typed pointer.

I believe we also use them to dispatch to xmalloc for hosts that cannot use 
malloc. Another issue that is gone for good with C++ - at least if you use 
'new'. Existing uses already are there, so just go ahead.
Support for constructing and destructing GC objects will be another story of 
course.

Richard.

>-- Gaby




Re: XNEW and consorts

2013-08-22 Thread Jakub Jelinek
On Thu, Aug 22, 2013 at 02:19:39PM +0200, Richard Biener wrote:
> Gabriel Dos Reis  wrote:
> >Now that we have transitioned to C++, do we still need to use
> >placebo like XNEW and XNEWVEC in GCC source code proper?
> >(I am not talking about uses in liberty.)
> >
> >Note that XNEW in particular does not work for types with
> >non-default constructors.
> >
> >We introduced these macros so that they take care of casts
> >that were required for going from void* to T*.  A new-expression
> >automatically gives a typed pointer.
> 
> I believe we also use them to dispatch to xmalloc for hosts that cannot
> use malloc.  Another issue that is gone for good with C++ - at least if
> you use 'new'.  Existing uses already are there, so just go ahead. 
> Support for constructing and destructing GC objects will be another story
> of course.

Do we install new_handler though so that if memory allocation fails it
shows the desired message and exits with the right code, rather than
just aborting?

Jakub


Re: XNEW and consorts

2013-08-22 Thread Dodji Seketeli
Richard Biener  a écrit:

> Support for constructing and destructing GC objects will be another
> story of course.

Just curious.  Does supporting this take more than just defining new and delete
operators that call ggc_alloc_*/ggc_free in there?

(OK, that and defining the object walking routines that the GC needs)

Cheers.

-- 
Dodji


[RFC] Offloading Support in libgomp

2013-08-22 Thread Michael V. Zolotukhin
Hi,
We're working on design for offloading support in GCC (part of OpenMP4), and I
have a question regarding libgomp part.

Suppose we expand '#pragma omp target' like we expand '#pragma omp parallel',
i.e. the compiler expands the following code:
  #pragma omp target
  {
body;
  }
to this:
  void subfunction (void *data)
  {
use data;
body;
  }

  setup data;
  function_name = "subfunction";
  GOMP_offload (subfunction, &data, function_name);

GOMP_offload is a call to libgomp, which will be implemented somehow like this:
  void GOMP_offload (void (*fn)(void*), void *data, const char *fname)
  {
if (gomp_offload_available ())
  {
handler = gomp_upload_data (data);
gomp_offload_call (fname, handler);
gomp_download_data (&data, handler);
  }
else
  {
fn (data);
  }
  }

Routines gomp_upload_data, gomp_offload_call and similar could, for example, use
COI (see
http://download-software.intel.com/sites/default/files/article/334766/intel-xeon-phi-systemssoftwaredevelopersguide_0.pdf)
functions to perform actual data marshalling and calling routines on the target
side.

Does this generic scheme sounds ok to you?

We'd probably want to be able to use the same compiler for different
offload-targets, so it's important to decide how we would invoke different
implementations of these routines with the same compiler.  One way to do it is
to use dlopen-routines - i.e. we try to load, say, "libtargetiface.so" and if it
fails, we use some default (dummy) implementations - otherwise we use the
versions from the library.  In this approach, along with libgomp.so we'll need
to have libtargetiface.so for each target we want to offload to.  Is this way
viable, or should it be done in some other way?

--
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.


Re: XNEW and consorts

2013-08-22 Thread Gabriel Dos Reis
On Thu, Aug 22, 2013 at 7:23 AM, Jakub Jelinek  wrote:
> On Thu, Aug 22, 2013 at 02:19:39PM +0200, Richard Biener wrote:
>> Gabriel Dos Reis  wrote:
>> >Now that we have transitioned to C++, do we still need to use
>> >placebo like XNEW and XNEWVEC in GCC source code proper?
>> >(I am not talking about uses in liberty.)
>> >
>> >Note that XNEW in particular does not work for types with
>> >non-default constructors.
>> >
>> >We introduced these macros so that they take care of casts
>> >that were required for going from void* to T*.  A new-expression
>> >automatically gives a typed pointer.
>>
>> I believe we also use them to dispatch to xmalloc for hosts that cannot
>> use malloc.  Another issue that is gone for good with C++ - at least if
>> you use 'new'.  Existing uses already are there, so just go ahead.
>> Support for constructing and destructing GC objects will be another story
>> of course.
>
> Do we install new_handler though so that if memory allocation fails it
> shows the desired message and exits with the right code, rather than
> just aborting?

Excellent suggestion.  I don't promise to implement it right
away but though (I have my hands full of something else)

-- Gaby


Re: XNEW and consorts

2013-08-22 Thread Gabriel Dos Reis
On Thu, Aug 22, 2013 at 8:51 AM, Dodji Seketeli  wrote:
> Richard Biener  a écrit:
>
>> Support for constructing and destructing GC objects will be another
>> story of course.
>
> Just curious.  Does supporting this take more than just defining new and 
> delete
> operators that call ggc_alloc_*/ggc_free in there?
>
> (OK, that and defining the object walking routines that the GC needs)

A little it more.  Ideally, we would want placement-new forms
for these, e.g.

new (ggc) T(args);

and then audit all the places we use ggc_alloc, etc.

-- Gaby


Re: [RFC] Offloading Support in libgomp

2013-08-22 Thread Jakub Jelinek
On Thu, Aug 22, 2013 at 06:08:10PM +0400, Michael V. Zolotukhin wrote:
> We're working on design for offloading support in GCC (part of OpenMP4), and I
> have a question regarding libgomp part.
> 
> Suppose we expand '#pragma omp target' like we expand '#pragma omp parallel',
> i.e. the compiler expands the following code:
>   #pragma omp target
>   {
> body;
>   }
> to this:
>   void subfunction (void *data)
>   {
> use data;
> body;
>   }
> 
>   setup data;
>   function_name = "subfunction";
>   GOMP_offload (subfunction, &data, function_name);

Roughly.  We have 3 directives here,
#pragma omp target
#pragma omp target data
#pragma omp target update
and all of them have various clauses, some that are allowed at most once
(e.g. the device clause, if clause) and others that can be used many times
(the data movement clauses).
I'd prefer GOMP_target instead of GOMP_offload for the function name, to
make it clearly related to the corresponding directive.
The question is if we want to emit multiple calls for the single directive,
say one for each data movement clause (where for each one we need address,
length, direction and some way how to propagate the transformed address
to the accelerator code), or if we build an array of the data movement
structures and just pass that down to a single routine.  Because of the
device clause which should be probably passed just as an integer with -1
meaning the default, perhaps single routine might be better.

> GOMP_offload is a call to libgomp, which will be implemented somehow like 
> this:
>   void GOMP_offload (void (*fn)(void*), void *data, const char *fname)
>   {
> if (gomp_offload_available ())

This really isn't just check whether accelerator is available, we need to
query all accelerators in the system (and cache that somehow in the
library), assign device numbers to individual devices (say, you could have
two Intel MIC cards, one AMD HSAIL capable GPGPU and 4 Nvidia PTX capable
GPGPUs or similar), ensure that already assigned device numbers aren't
reused when discovering new ones and then just check what device user
requested (if not available, fall back to host), next check see if we
have corresponding code for that accelerator (again, fallback to host
otherwise), optionally compile the code if not compiled yet (HSAIL/PTX code
only) then finally do the name lookup and spawn it.
Stuff specific to the HW should be in libgomp plugins IMHO, so we have one
dlopenable module for each of the 3 variants, where one fn in the plugin
would be about checking what HW is available, one about trying to run the
code etc.

Jakub


Re: Propose moving vectorization from -O3 to -O2.

2013-08-22 Thread Xinliang David Li
On Thu, Aug 22, 2013 at 1:24 AM, Ondřej Bílka  wrote:
> On Wed, Aug 21, 2013 at 11:50:34PM -0700, Xinliang David Li wrote:
>> > The effect on runtime is not correlated to
>> > either (which means the vectorizer cost model is rather bad), but integer
>> > code usually does not benefit at all.
>>
>> The cost model does need some tuning. For instance, GCC vectorizer
>> does peeling aggressively, but  peeling in many cases can be avoided
>> while still gaining good performance -- even when target does not have
>> efficient unaligned load/store to implement unaligned access. GCC
>> reports too high cost for unaligned access while too low for peeling
>> overhead.
>>
> Another issue is that gcc generates very ineffective headers. If I
> change example with following line
>
> foo(a+rand()%1, b+rand()%1, c+rand()%1, rand()%64);
>
> then I get vectorizer regression of
> gcc-4.7 -O3 x.c -o xa
> versus
> gcc-4.7 -O2 -funroll-all-loops x.c -o xb
>
>> Example:
>>
>> ifndef TYPE
>> #define TYPE float
>> #endif
>> #include 
>>
>> __attribute__((noinline)) void
>> foo (TYPE *a, TYPE* b, TYPE *c, int n)
>> {
>>int i;
>>for ( i = 0; i < n; i++)
>>  a[i] = b[i] * c[i];
>> }
>>
>> int g;
>> int
>> main()
>> {
>>int i;
>>float *a = (float*) malloc (10*4);
>>float *b = (float*) malloc (10*4);
>>float *c = (float*) malloc (10*4);
>>
>>for (i = 0; i < 10; i++)
>>   foo(a, b, c, 10);
>>
>>
>>g = a[10];
>>
>> }
>>


Good test.   I also change the test case to force the start address to
be misaligned by calling foo with foo(a+1,b+1, c+1, 1), 3)'s
performance drops from 1.5s to 2.5s, but still much better than 2) and
4).  One correction -- plain O2 is the slowest -- the runtime is about
5.4s.  (all tests use trunk compiler with -O2 -ftree-vectorize).

I tried your case with trunk compiler (O2 -ftree-vectorize), the
runtime on a westmere machine:

1) -march=corei7 : 2.1s
2) -march=x86-64: 4.8s
3) NOPEEL + -march=corei7 : 2.2s
4) NOPEEL + -march=x86-64: 4.8s
5) -O2  : 5.5s
6) -O3 -funroll-all-loops -march=corei6 : 2.2s
7) -O3 -funroll-all-loops -march=x86-64: 4.3s
8) -O2 -funroll-all-loops : 4.6s


With random start address alignment, 3) is very close to 1) in reality
so it is the best choice.


>>
>> 1) by default, GCC's vectorizer will peel the loop in foo, so that
>> access to 'a' is aligned and using movaps instruction. The other
>> accesses are using movups when -march=corei7 is used
>> 2) Same as above, but -march=x86_64. Access to b is split into 'movlps
>> and movhps', same for 'c'
>>
>> 3) Disabling peeling (via a hack) with -march=corei7 --- all three
>> accesses are using movups
>> 4) Disabling peeling, with -march=x86-64 -- all three accesses are
>> using movlps/movhps
>>
>> Performance:
>>
>> 1) and 3) -- both 1.58s, but 3) is much smaller than 1). 3)'s text is
>> 1462 bytes, and 1) is 1622 bytes
>> 3) and 4) and no vectorize -- all very slow -- 4.8s
>>
> This could be explained by lack of unrolling. When unrolling is enabled
> a slowdown is only 20% over sse variant.

Standalone unroller tuning is an orthogonal issue here.  Note that we
are shooting for the best possible vectorizer performance (to be
turned on at O2) under the strict size/compile time increase
constraints.

>
>> > That said, I expect 99% of used software
>> > (probably rather 99,9%) is not compiled on the system it runs on but
>> > compiled to run on generic hardware and thus restricts itself to bare 
>> > x86_64
>> > SSE2 features.  So what matters for enabling the vectorizer at -O2 is the
>> > default architecture features of the given architecture(!) - remember
>> > to not only
>> > consider x86 here!
>> >
> This is non-issue as sse2 already contains most of operations needed.
> Performance improvement from additional ss* is minimal.
>
> A performance improvements over sse2 could be with avx/avx2 but it would
> vectorizer of avx is still severely lacking.
>
>> > The same argument was done on the fact that GCC does not optimize by 
>> > default
>> > but uses -O0.  It's a straw-mans argument.  All "benchmarking" I see uses
>> > -O3 or -Ofast already.
>>
>> People can just do -O2 performance comparison.
>>
> When machines spend 95% of time in code compiled by gcc -O2 then
> benchmarking should be done on -O2.
> With any other flags you will just get bunch of numbers which are not
> very related to performance.

yes.


thanks,

David
>
>


Vandalised wiki page

2013-08-22 Thread Alec Teal

http://gcc.gnu.org/wiki/FunctionMultiVersioning

Reported by "kobrien" on the Freenode IRC network, channel #gcc just 
now, I'm just sending the message.


Alec



gcc-4.8-20130822 is now available

2013-08-22 Thread gccadmin
Snapshot gcc-4.8-20130822 is now available on
  ftp://gcc.gnu.org/pub/gcc/snapshots/4.8-20130822/
and on various mirrors, see http://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 4.8 SVN branch
with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-4_8-branch 
revision 201929

You'll find:

 gcc-4.8-20130822.tar.bz2 Complete GCC

  MD5=be880b91947e9750fdbc434e77ed0289
  SHA1=dfd2d0c9a872b846377628b5ff80aa0b2265c7a9

Diffs from 4.8-20130815 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-4.8
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.


Why out-of-ssa does var coalescing based on name?

2013-08-22 Thread Wei Mi
(Sorry if you received the mail twice. The first one was rejected
because it was not plain text mode)

For the following case:

float total = 0.2;

int main() {
 int i;

 for (i = 0; i < 10; i++) {
   total += i;
 }

 return total == 0.3;
}

The gcc assembly of its kernel loop is:

.L3:
   movaps  %xmm0, %xmm1
.L2:
   cvtsi2ss%eax, %xmm0
   addl$1, %eax
   cmpl$10, %eax
   addss   %xmm1, %xmm0
   jne .L3

The movaps is redundent, the loop could be changed to:

.L3:
   cvtsi2ss%eax, %xmm1
   addl$1, %eax
   cmpl$10, %eax
   addss   %xmm1, %xmm0
   jne .L3

Manually removing the extra movaps improves performance from 1.26s to
0.95s on sandybridge using trunk (r201859).

load PRE tries to promote MEM op of total out of the loop, it
generates a new PHI at the start of loop body:

 :
 pretmp_22 = total;
 goto ;

 :

 :
 # i_15 = PHI 
 # prephitmp_23 = PHI==> PHI generated.
 _4 = (float) i_15;
 total.0_5 = prephitmp_23;
 total.1_6 = _4 + total.0_5;
 total = total.1_6;
 i_8 = i_15 + 1;
 if (i_8 != 10)
   goto ;
 else
   goto ;

out-of-ssa phase should have coalesced prephitmp_23 and total.1_6(3)
to the same temp var, but existing out-of-ssa has a limitation that it
will not coalesce ssa variables with different base var names, even if
they are in the same phi and their live ranges don't conflict. So
out-of-ssa will insert the redundent mov pretmp = total.1_6 in bb3.

 :
 pretmp = total;
 goto ;

 :
 pretmp = total.1_6;==> inserted by out-of-ssa.

 :
 _4 = (float) i_15;
 total.1_6 = _4 + pretmp;
 i_8 = i_15 + 1;
 if (i_8 != 10)
   goto ;
 else
   goto ;

IRA phase has the potential to allocate pretmp and total.1_6 to the
same hardreg and remove the extra mov, but for the above case, regmove
phase happen to block ira from doing the cleanup. regmove guesses the
register constraint of an insn and try to change the insn to satisfy
the constraint before IRA phase. Usually it could help IRA make a
better decision, but here regmove decides to merge _4 and total.1_6
into total.1_6 in order to satisfy the constraint of two operand plus
on x86 (addss xmm1, xmm2). After _4 and total.1_6 are merged, The live
range of total.1_6 has conflict with that of pretmp in IRA, so they
cannot be allocated to the same hardreg, and the redundent mov (pretmp
= total.1_6) couldn't be deleted. However, It is not trivial to make
regmove choose to merge total.1_6 and pretmp, because it requires
regmove to have global live range analysis (Existing regmove has
simple correctness check in a range limited to single bb).

If we use -mtune=corei7-avx, then the redundent mov disappear. That is
because after using avx support, regmove knows avx provide three
operands plus: vaddsd xmm1, xmm2, xmm3/m32, so it will not merge
total.1_6 and _4, then IRA could allocate total.1_6 and pretmp to the
same hardreg.

If we change the type of total from float to int, then the redundent
mov also disappears. It has similar reason as the above one. x86
provides LEA insn which could be used as plus op and it could have
three operands, so regmove chooses not to merge total.1_6 and _4.

My question is, why out-of-ssa cannot do the cleanup by coalescing all
the vars without conflicts in the same phi stmt, instead of only
coalescing the vars with the same base name?

Thanks,
Wei Mi.


Re: Propose moving vectorization from -O3 to -O2.

2013-08-22 Thread Xinliang David Li
Interesting idea!

David

On Thu, Aug 22, 2013 at 4:46 PM, Cong Hou  wrote:
> Another opportunity to reduce the code size is combining the scalar version
> from loop versioning, the prolog and the epilog of loop peeling. I manually
> made the following function for foo(). The running time does not change (for
> corei7 since I use _mm_loadu_ps()) but the text size (for the function only)
> reduces from 342 to 240 (41 for non-vectorized version). We can get more
> benefit if the loop body is larger.
>
>
> void foo2 (TYPE * a, TYPE* b, TYPE * c, int n)
> {
>   int i, m, next;
>   __m128 veca, vecb, vecc;
>
>   i = 0;
>
>   if ((b >= a+4 | b+4 <= a) &
>   (c >= a+4 | c+4 <= a))
>   {
> m = ((unsigned long)a & 127) >> 5;
> goto L2;
>
> L1:
> for (; i < n; i+=4)
> {
>   vecb = _mm_loadu_ps(b+i);
>   vecc = _mm_loadu_ps(c+i);
>   veca = _mm_mul_ps(vecb, vecc);
>   _mm_store_ps(a+i, veca);
> }
> m = (i == n) ? n : n+4;
>   }
>
> L2:
>   for (; i < m; i++)
> a[i] = b[i] * c[i];
>   if (i < n)
> goto L1;
> }
>
>
>
> thanks,
>
> Cong
>
>
> On Wed, Aug 21, 2013 at 11:50 PM, Xinliang David Li 
> wrote:
>>
>> > The effect on runtime is not correlated to
>> > either (which means the vectorizer cost model is rather bad), but
>> > integer
>> > code usually does not benefit at all.
>>
>> The cost model does need some tuning. For instance, GCC vectorizer
>> does peeling aggressively, but  peeling in many cases can be avoided
>> while still gaining good performance -- even when target does not have
>> efficient unaligned load/store to implement unaligned access. GCC
>> reports too high cost for unaligned access while too low for peeling
>> overhead.
>>
>> Example:
>>
>> ifndef TYPE
>> #define TYPE float
>> #endif
>> #include 
>>
>> __attribute__((noinline)) void
>> foo (TYPE *a, TYPE* b, TYPE *c, int n)
>> {
>>int i;
>>for ( i = 0; i < n; i++)
>>  a[i] = b[i] * c[i];
>> }
>>
>> int g;
>> int
>> main()
>> {
>>int i;
>>float *a = (float*) malloc (10*4);
>>float *b = (float*) malloc (10*4);
>>float *c = (float*) malloc (10*4);
>>
>>for (i = 0; i < 10; i++)
>>   foo(a, b, c, 10);
>>
>>
>>g = a[10];
>>
>> }
>>
>>
>> 1) by default, GCC's vectorizer will peel the loop in foo, so that
>> access to 'a' is aligned and using movaps instruction. The other
>> accesses are using movups when -march=corei7 is used
>> 2) Same as above, but -march=x86_64. Access to b is split into 'movlps
>> and movhps', same for 'c'
>>
>> 3) Disabling peeling (via a hack) with -march=corei7 --- all three
>> accesses are using movups
>> 4) Disabling peeling, with -march=x86-64 -- all three accesses are
>> using movlps/movhps
>>
>> Performance:
>>
>> 1) and 3) -- both 1.58s, but 3) is much smaller than 1). 3)'s text is
>> 1462 bytes, and 1) is 1622 bytes
>> 3) and 4) and no vectorize -- all very slow -- 4.8s
>>
>> Observations:
>> a)  if properly tuned, for corei7, 3) should be picked by GCC instead
>> of 1) -- this is not possible today
>> b) with march=x86_64, GCC should figure out the benefit of vectorizing
>> the loop is small and bail out
>>
>> >> On the other hand, 10% compile time increase due to one pass sounds
>> >> excessive -- there might be some low hanging fruit to reduce the
>> >> compile time increase.
>> >
>> > I have already spent two man-month speeding up the vectorizer itself,
>> > I don't think there is any low-hanging fruit left there.  But see above
>> > - most
>> > of the compile-time is due to the cost of processing the extra loop
>> > copies.
>> >
>>
>> Ok.
>>
>> I did not notice your patch (in May this year) until recently. Do you
>> plan to check it in (other than the part to turn in at O2). The cost
>> model part of the changes are largely independent. If it is in, it
>> will serve as a good basis for further tuning.
>>
>>
>> >>  at full feature set vectorization regresses runtime of quite a number
>> >> of benchmarks significantly. At reduced feature set - basically trying
>> >> to vectorize only obvious profitable cases - these regressions can be
>> >> avoided but progressions only remain on two spec fp cases. As most
>> >> user applications fall into the spec int category a 10% compile-time
>> >> and 15% code-size regression for no gain is no good.
>> >>>
>> >>
>> >> Cong's data (especially corei7 and corei7avx) shows more significant
>> >> performance improvement.   If 10% compile time increase is across the
>> >> board and happens on benchmarks with no performance improvement, it is
>> >> certainly bad - but I am not sure if that is the case.
>> >
>> > Note that we are talking about -O2 - people that enable -march=corei7
>> > usually
>> > know to use -O3 or FDO anyway.
>>
>> Many people uses FDO, but not all -- there are still some barriers for
>> adoption. There are reasons people may not want to use O3:
>> 1) people feel most comfortable to use O2 because it is considered the
>> most thoroughly tested compiler optimi