Re: suspect code in fold-const.c

2013-11-15 Thread Eric Botcazou
> this code from fold-const.c starts on line 13811.
> 
>  else if (TREE_INT_CST_HIGH (arg1) == signed_max_hi
>   && TREE_INT_CST_LOW (arg1) == signed_max_lo
>   && TYPE_UNSIGNED (arg1_type)
>   /* We will flip the signedness of the comparison operator
>  associated with the mode of arg1, so the sign bit is
>  specified by this mode.  Check that arg1 is the signed
>  max associated with this sign bit.  */
>   && width == GET_MODE_BITSIZE (TYPE_MODE (arg1_type))
>   /* signed_type does not work on pointer types.  */
>   && INTEGRAL_TYPE_P (arg1_type))

with width defined as:

unsigned int width = TYPE_PRECISION (arg1_type);

> it seems that the check on bitsize should really be a check on the
> precision of the variable.   If this seems right, i will correct this on
> the trunk and make the appropriate changes to the wide-int branch.

Do you mean

  && width == GET_MODE_PRECISION (TYPE_MODE (arg1_type))

instead?  If so, that would probably make sense, but there are a few other 
places with the same TYPE_PRECISION/GET_MODE_BITSIZE check, in particular the 
very similar transformation done in fold_single_bit_test_into_sign_test.

-- 
Eric Botcazou


Vectorization: Loop peeling with misaligned support.

2013-11-15 Thread Bingfeng Mei
Hi,
In loop vectorization, I found that vectorizer insists on loop peeling even our 
target supports misaligned memory access. This results in much bigger code size 
for a very simple loop. I defined TARGET_VECTORIZE_SUPPORT_VECTOR_MISALGINMENT 
and also TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST to make misaligned 
accesses almost as cheap as an aligned one. But the vectorizer still does 
peeling anyway.

In vect_enhance_data_refs_alignment function, it seems that result of 
vect_supportable_dr_alignment is not used in decision of whether to do peeling. 

  supportable_dr_alignment = vect_supportable_dr_alignment (dr, true);
  do_peeling = vector_alignment_reachable_p (dr);

Later on, there is code to compare load/store costs. But it only decides 
whether to do peeling for load or store, not whether to do peeling.

Currently I have a workaround. For the following simple loop, the size is 
80bytes vs. 352 bytes without patch (-O2 -ftree-vectorize gcc 4.8.3 20131114)
int A[100];
int B[100];
void foo2() {
  int i;
  for (i = 0; i < 100; ++i)
    A[i] = B[i] + 100;
}

What is the best way to tell vectorizer not to do peeling in such situation? 


Thanks,
Bingfeng Mei
Broadcom UK



Re: suspect code in fold-const.c

2013-11-15 Thread Kenneth Zadeck

On 11/15/2013 04:07 AM, Eric Botcazou wrote:

this code from fold-const.c starts on line 13811.

  else if (TREE_INT_CST_HIGH (arg1) == signed_max_hi
   && TREE_INT_CST_LOW (arg1) == signed_max_lo
   && TYPE_UNSIGNED (arg1_type)
   /* We will flip the signedness of the comparison operator
  associated with the mode of arg1, so the sign bit is
  specified by this mode.  Check that arg1 is the signed
  max associated with this sign bit.  */
   && width == GET_MODE_BITSIZE (TYPE_MODE (arg1_type))
   /* signed_type does not work on pointer types.  */
   && INTEGRAL_TYPE_P (arg1_type))

with width defined as:

unsigned int width = TYPE_PRECISION (arg1_type);


it seems that the check on bitsize should really be a check on the
precision of the variable.   If this seems right, i will correct this on
the trunk and make the appropriate changes to the wide-int branch.

Do you mean

   && width == GET_MODE_PRECISION (TYPE_MODE (arg1_type))

instead?  If so, that would probably make sense, but there are a few other
places with the same TYPE_PRECISION/GET_MODE_BITSIZE check, in particular the
very similar transformation done in fold_single_bit_test_into_sign_test.

yes.  I understand the need to do this check on the mode rather than the 
precision of the type itself.
The point is that if the mode under this type happens to be a partial 
int mode, then that sign bit may not even be where the bitsize points to.


However, having just done a few greps, it looks like this case was just 
the one that i found while doing the wide-int work, there may be several 
more of these cases.   Just in fold-const, there are a couple in 
fold_binary_loc.   The one in tree.c:int_fits_type_p looks particularly 
wrong.


I think that there are also several in tree-vect-patterns.c.

Kenny


Re: Vectorization: Loop peeling with misaligned support.

2013-11-15 Thread Richard Biener
On Fri, Nov 15, 2013 at 2:16 PM, Bingfeng Mei  wrote:
> Hi,
> In loop vectorization, I found that vectorizer insists on loop peeling even 
> our target supports misaligned memory access. This results in much bigger 
> code size for a very simple loop. I defined 
> TARGET_VECTORIZE_SUPPORT_VECTOR_MISALGINMENT and also 
> TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST to make misaligned accesses 
> almost as cheap as an aligned one. But the vectorizer still does peeling 
> anyway.
>
> In vect_enhance_data_refs_alignment function, it seems that result of 
> vect_supportable_dr_alignment is not used in decision of whether to do 
> peeling.
>
>   supportable_dr_alignment = vect_supportable_dr_alignment (dr, true);
>   do_peeling = vector_alignment_reachable_p (dr);
>
> Later on, there is code to compare load/store costs. But it only decides 
> whether to do peeling for load or store, not whether to do peeling.
>
> Currently I have a workaround. For the following simple loop, the size is 
> 80bytes vs. 352 bytes without patch (-O2 -ftree-vectorize gcc 4.8.3 20131114)

What's the speed difference?

> int A[100];
> int B[100];
> void foo2() {
>   int i;
>   for (i = 0; i < 100; ++i)
> A[i] = B[i] + 100;
> }
>
> What is the best way to tell vectorizer not to do peeling in such situation?

Well, the vectorizer should compute the cost without peeling and then,
when the cost with peeling is not better then do not peel.  That's
very easy to check with the vectorization_cost hook by comparing
vector_load / unaligned_load and vector_store / unaligned_store cost.

Richard.

>
> Thanks,
> Bingfeng Mei
> Broadcom UK
>


RE: Vectorization: Loop peeling with misaligned support.

2013-11-15 Thread Bingfeng Mei
Hi, Richard,
Speed difference is 154 cycles (with workaround) vs. 198 cycles. So loop 
peeling is also slower for our processors.

By vectorization_cost, do you mean TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST 
hook? 

In our case, it is easy to make decision. But generally, if peeling loop is 
faster but bigger, what should be right balance? How to do with cases that are 
a bit faster and a lot bigger?

Thanks,
Bingfeng
-Original Message-
From: Richard Biener [mailto:richard.guent...@gmail.com] 
Sent: 15 November 2013 14:02
To: Bingfeng Mei
Cc: gcc@gcc.gnu.org
Subject: Re: Vectorization: Loop peeling with misaligned support.

On Fri, Nov 15, 2013 at 2:16 PM, Bingfeng Mei  wrote:
> Hi,
> In loop vectorization, I found that vectorizer insists on loop peeling even 
> our target supports misaligned memory access. This results in much bigger 
> code size for a very simple loop. I defined 
> TARGET_VECTORIZE_SUPPORT_VECTOR_MISALGINMENT and also 
> TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST to make misaligned accesses 
> almost as cheap as an aligned one. But the vectorizer still does peeling 
> anyway.
>
> In vect_enhance_data_refs_alignment function, it seems that result of 
> vect_supportable_dr_alignment is not used in decision of whether to do 
> peeling.
>
>   supportable_dr_alignment = vect_supportable_dr_alignment (dr, true);
>   do_peeling = vector_alignment_reachable_p (dr);
>
> Later on, there is code to compare load/store costs. But it only decides 
> whether to do peeling for load or store, not whether to do peeling.
>
> Currently I have a workaround. For the following simple loop, the size is 
> 80bytes vs. 352 bytes without patch (-O2 -ftree-vectorize gcc 4.8.3 20131114)

What's the speed difference?

> int A[100];
> int B[100];
> void foo2() {
>   int i;
>   for (i = 0; i < 100; ++i)
> A[i] = B[i] + 100;
> }
>
> What is the best way to tell vectorizer not to do peeling in such situation?

Well, the vectorizer should compute the cost without peeling and then,
when the cost with peeling is not better then do not peel.  That's
very easy to check with the vectorization_cost hook by comparing
vector_load / unaligned_load and vector_store / unaligned_store cost.

Richard.

>
> Thanks,
> Bingfeng Mei
> Broadcom UK
>




Re: [RFC] Target compilation for offloading

2013-11-15 Thread Andrey Turetskiy
Let's suppose, we are going to run target gcc driver from lto-wrapper.
How could a list of offload targets be passed there from option
parser?
In my opinion, the simpliest way to do it is to use environment
variable. Would you agree with such approach?

On Fri, Nov 8, 2013 at 6:34 PM, Jakub Jelinek  wrote:
> On Fri, Nov 08, 2013 at 06:26:53PM +0400, Andrey Turetskiy wrote:
>> Thanks.
>> And a few questions about compiler options:
>> 1) You've mentioned two options for offloading:
>> -foffload-target= - to specify targets for offloading
>> -foffload-target-= - to specify
>> compiler options for different targets
>> Do we really need two options to set up offloading?
>> What do you think about, in my opinion, more compact way:
>> -foffload- - if I want to offload for 'target name',
>> but I don't want to specify any options
>> -foffload-= - enable offloading for
>> 'target name' and set options
>> And compilation for several targets would look like:
>> gcc -fopenmp -foffload-mic="-O3 -msse -m64" -foffload-ptx
>> -foffload-hsail="-O2 -m32" file.c
>
> I don't think it is a good idea to include the target name before =
> in the name of the option, but perhaps you can use two =s:
> -foffloat-target=x86_64-k1om-linux="-O2 -mtune=foobar' 
> -foffloat-target=ptx-none
>
>> 2) If user doesn't specify target options directly, is target
>> compilation done without any option or compiler uses those host
>> options which are suitable for target?
>
> I think I've said that earlier, non-target specific options from original
> compilation should be copied over, target specific options discarded,
> and the command line supplied overrides appended to that.
>
>> 3) Am I understand right, that options for different targets should be
>> stored in different sections in fat object file, and than lto frontend
>> should read theese options and run target compilation with them?
>
> No, I'd store in the LTO target IL only the original host compilation
> options that weren't target specific (opt* has some flags what is target
> specific and what is not), so say -O2 -ftree-vrp would go there,
> but say -march=corei7-avx would not.  And the -foffload-target= options
> would only matter during linking.
>
> Jakub



-- 
Best regards,
Andrey Turetskiy


Re: Vectorization: Loop peeling with misaligned support.

2013-11-15 Thread Hendrik Greving
Also keep in mind that usually costs go up significantly if
misalignment causes cache line splits (processor will fetch 2 lines).
There are non-linear costs of filling up the store queue in modern
out-of-order processors (x86). Bottom line is that it's much better to
peel e.g. for AVX2/AVX3 if the loop would cause loads that cross cache
line boundaries otherwise. The solution is to either actually always
peel for alignment, or insert an additional check for cache line
boundaries (for high trip count loops).
- Hendrik

On Fri, Nov 15, 2013 at 7:21 AM, Bingfeng Mei  wrote:
> Hi, Richard,
> Speed difference is 154 cycles (with workaround) vs. 198 cycles. So loop 
> peeling is also slower for our processors.
>
> By vectorization_cost, do you mean 
> TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST hook?
>
> In our case, it is easy to make decision. But generally, if peeling loop is 
> faster but bigger, what should be right balance? How to do with cases that 
> are a bit faster and a lot bigger?
>
> Thanks,
> Bingfeng
> -Original Message-
> From: Richard Biener [mailto:richard.guent...@gmail.com]
> Sent: 15 November 2013 14:02
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org
> Subject: Re: Vectorization: Loop peeling with misaligned support.
>
> On Fri, Nov 15, 2013 at 2:16 PM, Bingfeng Mei  wrote:
>> Hi,
>> In loop vectorization, I found that vectorizer insists on loop peeling even 
>> our target supports misaligned memory access. This results in much bigger 
>> code size for a very simple loop. I defined 
>> TARGET_VECTORIZE_SUPPORT_VECTOR_MISALGINMENT and also 
>> TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST to make misaligned accesses 
>> almost as cheap as an aligned one. But the vectorizer still does peeling 
>> anyway.
>>
>> In vect_enhance_data_refs_alignment function, it seems that result of 
>> vect_supportable_dr_alignment is not used in decision of whether to do 
>> peeling.
>>
>>   supportable_dr_alignment = vect_supportable_dr_alignment (dr, true);
>>   do_peeling = vector_alignment_reachable_p (dr);
>>
>> Later on, there is code to compare load/store costs. But it only decides 
>> whether to do peeling for load or store, not whether to do peeling.
>>
>> Currently I have a workaround. For the following simple loop, the size is 
>> 80bytes vs. 352 bytes without patch (-O2 -ftree-vectorize gcc 4.8.3 20131114)
>
> What's the speed difference?
>
>> int A[100];
>> int B[100];
>> void foo2() {
>>   int i;
>>   for (i = 0; i < 100; ++i)
>> A[i] = B[i] + 100;
>> }
>>
>> What is the best way to tell vectorizer not to do peeling in such situation?
>
> Well, the vectorizer should compute the cost without peeling and then,
> when the cost with peeling is not better then do not peel.  That's
> very easy to check with the vectorization_cost hook by comparing
> vector_load / unaligned_load and vector_store / unaligned_store cost.
>
> Richard.
>
>>
>> Thanks,
>> Bingfeng Mei
>> Broadcom UK
>>
>
>


Re: Vectorization: Loop peeling with misaligned support.

2013-11-15 Thread Xinliang David Li
The right longer term fix is suggested by Richard. For now you can
probably override the peel parameter for your target (in the target
option_override function).

 maybe_set_param_value (PARAM_VECT_MAX_PEELING_FOR_ALIGNMENT,
0, opts->x_param_values, opts_set->x_param_values);

David

On Fri, Nov 15, 2013 at 7:21 AM, Bingfeng Mei  wrote:
> Hi, Richard,
> Speed difference is 154 cycles (with workaround) vs. 198 cycles. So loop 
> peeling is also slower for our processors.
>
> By vectorization_cost, do you mean 
> TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST hook?
>
> In our case, it is easy to make decision. But generally, if peeling loop is 
> faster but bigger, what should be right balance? How to do with cases that 
> are a bit faster and a lot bigger?
>
> Thanks,
> Bingfeng
> -Original Message-
> From: Richard Biener [mailto:richard.guent...@gmail.com]
> Sent: 15 November 2013 14:02
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org
> Subject: Re: Vectorization: Loop peeling with misaligned support.
>
> On Fri, Nov 15, 2013 at 2:16 PM, Bingfeng Mei  wrote:
>> Hi,
>> In loop vectorization, I found that vectorizer insists on loop peeling even 
>> our target supports misaligned memory access. This results in much bigger 
>> code size for a very simple loop. I defined 
>> TARGET_VECTORIZE_SUPPORT_VECTOR_MISALGINMENT and also 
>> TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST to make misaligned accesses 
>> almost as cheap as an aligned one. But the vectorizer still does peeling 
>> anyway.
>>
>> In vect_enhance_data_refs_alignment function, it seems that result of 
>> vect_supportable_dr_alignment is not used in decision of whether to do 
>> peeling.
>>
>>   supportable_dr_alignment = vect_supportable_dr_alignment (dr, true);
>>   do_peeling = vector_alignment_reachable_p (dr);
>>
>> Later on, there is code to compare load/store costs. But it only decides 
>> whether to do peeling for load or store, not whether to do peeling.
>>
>> Currently I have a workaround. For the following simple loop, the size is 
>> 80bytes vs. 352 bytes without patch (-O2 -ftree-vectorize gcc 4.8.3 20131114)
>
> What's the speed difference?
>
>> int A[100];
>> int B[100];
>> void foo2() {
>>   int i;
>>   for (i = 0; i < 100; ++i)
>> A[i] = B[i] + 100;
>> }
>>
>> What is the best way to tell vectorizer not to do peeling in such situation?
>
> Well, the vectorizer should compute the cost without peeling and then,
> when the cost with peeling is not better then do not peel.  That's
> very easy to check with the vectorization_cost hook by comparing
> vector_load / unaligned_load and vector_store / unaligned_store cost.
>
> Richard.
>
>>
>> Thanks,
>> Bingfeng Mei
>> Broadcom UK
>>
>
>


Frame pointer, bug or feature? (x86)

2013-11-15 Thread Hendrik Greving
In the below test case, "CASE_A" actually uses a frame pointer, while
!CASE_A doesn't. I can't imagine this is a feature, this is a bug,
isn't it? Is there any reason the compiler couldn't know that
loop_blocks never needs a dynamic stack size?

#include 
#include 

#define MY_DEFINE 100
#define CASE_A 1

extern init(int (*a)[]);

int
foo()
{
#if CASE_A
const int max = MY_DEFINE * 2;
int loop_blocks[max];
#else
int loop_blocks[MY_DEFINE * 2];
#endif
init(&loop_blocks);
return loop_blocks[5];
}

int
main()
{
int i = foo();
printf("is is %d\n", i);
}

Thanks,
Hendrik Greving


RE: Vectorization: Loop peeling with misaligned support.

2013-11-15 Thread Bingfeng Mei
Thanks for the suggestion. It seems that parameter is only available in HEAD, 
not in 4.8. I will backport to 4.8.

However, implementing a good cost model seems quite tricky to me. There are 
conflicting requirements for different processors. For us or many embedded 
processors, 4-time size increase is unacceptable. But for many desktop 
processor/applications, I guess it is worth to trade significant size with some 
performance improvement. Not sure if existing 
TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST is up to task. Maybe an extra 
target hook or parameter should be provided to make such tradeoff.

Additionally, it seems hard to accurately estimate the costs. As Hendrik 
pointed out, misaligned access will affect cache performance for some 
processors. But for our processor, it is OK. Maybe just to pass a high cost for 
misaligned access for such processor is sufficient to guarantee to generate 
loop peeling. 

Bingfeng


-Original Message-
From: Xinliang David Li [mailto:davi...@google.com] 
Sent: 15 November 2013 17:30
To: Bingfeng Mei
Cc: Richard Biener; gcc@gcc.gnu.org
Subject: Re: Vectorization: Loop peeling with misaligned support.

The right longer term fix is suggested by Richard. For now you can
probably override the peel parameter for your target (in the target
option_override function).

 maybe_set_param_value (PARAM_VECT_MAX_PEELING_FOR_ALIGNMENT,
0, opts->x_param_values, opts_set->x_param_values);

David

On Fri, Nov 15, 2013 at 7:21 AM, Bingfeng Mei  wrote:
> Hi, Richard,
> Speed difference is 154 cycles (with workaround) vs. 198 cycles. So loop 
> peeling is also slower for our processors.
>
> By vectorization_cost, do you mean 
> TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST hook?
>
> In our case, it is easy to make decision. But generally, if peeling loop is 
> faster but bigger, what should be right balance? How to do with cases that 
> are a bit faster and a lot bigger?
>
> Thanks,
> Bingfeng
> -Original Message-
> From: Richard Biener [mailto:richard.guent...@gmail.com]
> Sent: 15 November 2013 14:02
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org
> Subject: Re: Vectorization: Loop peeling with misaligned support.
>
> On Fri, Nov 15, 2013 at 2:16 PM, Bingfeng Mei  wrote:
>> Hi,
>> In loop vectorization, I found that vectorizer insists on loop peeling even 
>> our target supports misaligned memory access. This results in much bigger 
>> code size for a very simple loop. I defined 
>> TARGET_VECTORIZE_SUPPORT_VECTOR_MISALGINMENT and also 
>> TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST to make misaligned accesses 
>> almost as cheap as an aligned one. But the vectorizer still does peeling 
>> anyway.
>>
>> In vect_enhance_data_refs_alignment function, it seems that result of 
>> vect_supportable_dr_alignment is not used in decision of whether to do 
>> peeling.
>>
>>   supportable_dr_alignment = vect_supportable_dr_alignment (dr, true);
>>   do_peeling = vector_alignment_reachable_p (dr);
>>
>> Later on, there is code to compare load/store costs. But it only decides 
>> whether to do peeling for load or store, not whether to do peeling.
>>
>> Currently I have a workaround. For the following simple loop, the size is 
>> 80bytes vs. 352 bytes without patch (-O2 -ftree-vectorize gcc 4.8.3 20131114)
>
> What's the speed difference?
>
>> int A[100];
>> int B[100];
>> void foo2() {
>>   int i;
>>   for (i = 0; i < 100; ++i)
>> A[i] = B[i] + 100;
>> }
>>
>> What is the best way to tell vectorizer not to do peeling in such situation?
>
> Well, the vectorizer should compute the cost without peeling and then,
> when the cost with peeling is not better then do not peel.  That's
> very easy to check with the vectorization_cost hook by comparing
> vector_load / unaligned_load and vector_store / unaligned_store cost.
>
> Richard.
>
>>
>> Thanks,
>> Bingfeng Mei
>> Broadcom UK
>>
>
>




Re: Frame pointer, bug or feature? (x86)

2013-11-15 Thread Andrew Pinski
On Fri, Nov 15, 2013 at 9:31 AM, Hendrik Greving
 wrote:
> In the below test case, "CASE_A" actually uses a frame pointer, while
> !CASE_A doesn't. I can't imagine this is a feature, this is a bug,
> isn't it? Is there any reason the compiler couldn't know that
> loop_blocks never needs a dynamic stack size?


Both a feature and a bug.  In the CASE_A case (with GNU C) it is a VLA
while in the !CASE_A case (or in either case with C++), it is a normal
array definition.  The compiler could have converted the VLA to a
normal array but does not depending on the size of the array.

Thanks,
Andrew Pinski

>
> #include 
> #include 
>
> #define MY_DEFINE 100
> #define CASE_A 1
>
> extern init(int (*a)[]);
>
> int
> foo()
> {
> #if CASE_A
> const int max = MY_DEFINE * 2;
> int loop_blocks[max];
> #else
> int loop_blocks[MY_DEFINE * 2];
> #endif
> init(&loop_blocks);
> return loop_blocks[5];
> }
>
> int
> main()
> {
> int i = foo();
> printf("is is %d\n", i);
> }
>
> Thanks,
> Hendrik Greving


Re: Vectorization: Loop peeling with misaligned support.

2013-11-15 Thread Xinliang David Li
I agree it is hard to tune cost model to make it precise.

Trunk compiler now supports better command line control for cost model
selection. It seems to me that you can backport that change (as well
as changes to control loop and slp vectorizer with different options)
to your branch. With those, you can do the following:
1) turn on vectorization with -O2 : -O2 -ftree-loop-vectorize -- it
will use the 'cheap' model which disables peeling
or
2) -O3 -fvect-cost-model=cheap  --> it will also disabling peeling
3) Playing with different parameters to control peeling, alias check
versioning etc.

Better yet -- improve the vectorizer to reduce the cost in general
(e.g, better alias analysis, better alignment propagation, more
efficient runtime alias check etc).

thanks,

David

On Fri, Nov 15, 2013 at 10:01 AM, Bingfeng Mei  wrote:
> Thanks for the suggestion. It seems that parameter is only available in HEAD, 
> not in 4.8. I will backport to 4.8.
>
> However, implementing a good cost model seems quite tricky to me. There are 
> conflicting requirements for different processors. For us or many embedded 
> processors, 4-time size increase is unacceptable. But for many desktop 
> processor/applications, I guess it is worth to trade significant size with 
> some performance improvement. Not sure if existing 
> TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST is up to task. Maybe an extra 
> target hook or parameter should be provided to make such tradeoff.
>
> Additionally, it seems hard to accurately estimate the costs. As Hendrik 
> pointed out, misaligned access will affect cache performance for some 
> processors. But for our processor, it is OK. Maybe just to pass a high cost 
> for misaligned access for such processor is sufficient to guarantee to 
> generate loop peeling.
>
> Bingfeng
>
>
> -Original Message-
> From: Xinliang David Li [mailto:davi...@google.com]
> Sent: 15 November 2013 17:30
> To: Bingfeng Mei
> Cc: Richard Biener; gcc@gcc.gnu.org
> Subject: Re: Vectorization: Loop peeling with misaligned support.
>
> The right longer term fix is suggested by Richard. For now you can
> probably override the peel parameter for your target (in the target
> option_override function).
>
>  maybe_set_param_value (PARAM_VECT_MAX_PEELING_FOR_ALIGNMENT,
> 0, opts->x_param_values, opts_set->x_param_values);
>
> David
>
> On Fri, Nov 15, 2013 at 7:21 AM, Bingfeng Mei  wrote:
>> Hi, Richard,
>> Speed difference is 154 cycles (with workaround) vs. 198 cycles. So loop 
>> peeling is also slower for our processors.
>>
>> By vectorization_cost, do you mean 
>> TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST hook?
>>
>> In our case, it is easy to make decision. But generally, if peeling loop is 
>> faster but bigger, what should be right balance? How to do with cases that 
>> are a bit faster and a lot bigger?
>>
>> Thanks,
>> Bingfeng
>> -Original Message-
>> From: Richard Biener [mailto:richard.guent...@gmail.com]
>> Sent: 15 November 2013 14:02
>> To: Bingfeng Mei
>> Cc: gcc@gcc.gnu.org
>> Subject: Re: Vectorization: Loop peeling with misaligned support.
>>
>> On Fri, Nov 15, 2013 at 2:16 PM, Bingfeng Mei  wrote:
>>> Hi,
>>> In loop vectorization, I found that vectorizer insists on loop peeling even 
>>> our target supports misaligned memory access. This results in much bigger 
>>> code size for a very simple loop. I defined 
>>> TARGET_VECTORIZE_SUPPORT_VECTOR_MISALGINMENT and also 
>>> TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST to make misaligned accesses 
>>> almost as cheap as an aligned one. But the vectorizer still does peeling 
>>> anyway.
>>>
>>> In vect_enhance_data_refs_alignment function, it seems that result of 
>>> vect_supportable_dr_alignment is not used in decision of whether to do 
>>> peeling.
>>>
>>>   supportable_dr_alignment = vect_supportable_dr_alignment (dr, true);
>>>   do_peeling = vector_alignment_reachable_p (dr);
>>>
>>> Later on, there is code to compare load/store costs. But it only decides 
>>> whether to do peeling for load or store, not whether to do peeling.
>>>
>>> Currently I have a workaround. For the following simple loop, the size is 
>>> 80bytes vs. 352 bytes without patch (-O2 -ftree-vectorize gcc 4.8.3 
>>> 20131114)
>>
>> What's the speed difference?
>>
>>> int A[100];
>>> int B[100];
>>> void foo2() {
>>>   int i;
>>>   for (i = 0; i < 100; ++i)
>>> A[i] = B[i] + 100;
>>> }
>>>
>>> What is the best way to tell vectorizer not to do peeling in such situation?
>>
>> Well, the vectorizer should compute the cost without peeling and then,
>> when the cost with peeling is not better then do not peel.  That's
>> very easy to check with the vectorization_cost hook by comparing
>> vector_load / unaligned_load and vector_store / unaligned_store cost.
>>
>> Richard.
>>
>>>
>>> Thanks,
>>> Bingfeng Mei
>>> Broadcom UK
>>>
>>
>>
>
>


RFC: FLT_ROUNDS and fesetround

2013-11-15 Thread H.J. Lu
Hi,

float.h has

/* Addition rounds to 0: zero, 1: nearest, 2: +inf, 3: -inf, -1: unknown.  */
/* ??? This is supposed to change with calls to fesetround in .  */
#undef FLT_ROUNDS
#define FLT_ROUNDS 1

Clang introduces __builtin_flt_rounds and

#define FLT_ROUNDS (__builtin_flt_rounds())

I am not sure if it is the correct approach.  Is there any plan to
address this in GCC and glibc?

Thanks.

-- 
H.J.


Re: RFC: FLT_ROUNDS and fesetround

2013-11-15 Thread Joseph S. Myers
On Fri, 15 Nov 2013, H.J. Lu wrote:

> Hi,
> 
> float.h has
> 
> /* Addition rounds to 0: zero, 1: nearest, 2: +inf, 3: -inf, -1: unknown.  */
> /* ??? This is supposed to change with calls to fesetround in .  */
> #undef FLT_ROUNDS
> #define FLT_ROUNDS 1
> 
> Clang introduces __builtin_flt_rounds and
> 
> #define FLT_ROUNDS (__builtin_flt_rounds())
> 
> I am not sure if it is the correct approach.  Is there any plan to
> address this in GCC and glibc?

This is GCC bug 30569.  It's one of the more straightforward of the 
various floating-point conformance issues, fixable with more or less 
self-contained local changes whereas the general issues with exceptions 
and rounding modes support (I don't know if you are interested in those) 
would involve much more complicated and wide-ranging changes to fix, and 
much more initial design work.

FLT_ROUNDS can't involve a call to a libm function - or to any function 
outside the reserved and C90 namespaces (it can't call fegetround, even if 
a particular system has that in libc, because FLT_ROUNDS is in C90 and 
fegetround is in the user's namespace for C90).  So GCC needs to expand it 
inline (the expansion might involve a call to a reserved-namespace library 
function).

Given that it expands it inline, a function call __builtin_flt_rounds, 
that returns an int with the correct value, is the natural interface.  The 
following are my thoughts about how this might be implemented.

First, there seems no point in "optimizing" it to 1 in the 
-fno-rounding-math (default) case; if people use FLT_ROUNDS they'll expect 
accurate information even if not using -frounding-math.

Second, the default for targets not providing the relevant facilities will 
of course be to return 1.

Third, one might imagine expanding this either through a flt_rounds 
insn pattern, or through a target hook to expand earlier to trees or 
GIMPLE.  My inclination is the latter.  My reasoning is: a typical 
__builtin_flt_rounds implementation would probably use an appropriate 
instruction to access a floating-point control register, mask out two bits 
from that register, and then have a switch statement to map the two bits 
to the values specified for FLT_ROUNDS.  (You can easily enough do 
arbitrary permutations of 0-3 without a switch, but a switch is what it is 
in logical terms.)  A typical user would probably be doing "if (FLT_ROUNDS 
== N)" or "switch (FLT_ROUNDS)".  If the mapping from hardware bits to 
FLT_ROUNDS values is represented as a switch at GIMPLE level, the GIMPLE 
optimizers should be able to eliminate the conversion to FLT_ROUNDS 
convention and turn things into a simple switch on the masked register 
value.  (NB I haven't tested that - but if they don't, it's a clear missed 
optimization of more general use.)  Since the GIMPLE level is where such 
optimizations generally take place in GCC, it's best to represent the 
conversion to FLT_ROUNDS convention at the GIMPLE level.

Thus, I'd imagine a hook or hooks would specify (a) an 
architecture-specific built-in function to get the floating-point control 
register value (or appropriate code other than a built-in function call), 
(b) a mask to apply to that value, (c) the resulting values in the 
register for each rounding mode.  And generic code would take care of 
generating the switch to convert from machine values to FLT_ROUNDS values.  
(And if the mapping happens to be the identity map, it could avoid 
generating the switch.  That accommodates architectures that for any 
reason need to do their own expansion not based on extracting two bits and 
using a switch.)

Cases needing something special include powerpc-linux-gnu soft-float where 
the rounding mode is in libc rather than hardware.  Once 
 and 
 are reviewed 
I intend to follow up to those by adding __flt_rounds, 
__atomic_feholdexcept, __atomic_feclearexcept and __atomic_feupdateenv 
functions to powerpc-linux-gnu soft-float glibc for GCC to use in 
expansion of FLT_ROUNDS and atomic compound assignment.  Any hooks should 
accommodate targets wishing in some cases just to generate their own libc 
call like this.

-- 
Joseph S. Myers
jos...@codesourcery.com


Re: Vectorization: Loop peeling with misaligned support.

2013-11-15 Thread Ondřej Bílka
On Fri, Nov 15, 2013 at 09:17:14AM -0800, Hendrik Greving wrote:
> Also keep in mind that usually costs go up significantly if
> misalignment causes cache line splits (processor will fetch 2 lines).
> There are non-linear costs of filling up the store queue in modern
> out-of-order processors (x86). Bottom line is that it's much better to
> peel e.g. for AVX2/AVX3 if the loop would cause loads that cross cache
> line boundaries otherwise. The solution is to either actually always
> peel for alignment, or insert an additional check for cache line
> boundaries (for high trip count loops).

That is quite bold claim do you have a benchmark to support that?

Since nehalem there is no overhead of unaligned sse loads except of fetching
cache lines. As haswell avx2 loads behave in similar way.

You are forgetting that loop needs both cache lines when it issues
unaligned load. This will generaly take maximum of times needed to
access these lines. Now with peeling you accesss first cache line, and
after that in loop access the second, effectively doubling running time
when both lines were in main memory.

You also need to compute all factors not just that one factor is
expensive. There are several factor in plays, cost of branch
misprediction is main argument againist doing peeling, so you need to
show that cost of unaligned loads is bigger than cost of branch
misprediction of a peeled implementation.

As a quick example why peeling is generaly bad idea I did a simple
benchmark. Could somebody with haswell also test attached code generated
by gcc -O3 -march=core-avx2 (files set[13]_avx2.s)?

For the test we repeately call a function set with a pointer randomly
picked from 262144 bytes to stress a L2 cache, relevant tester 
is following (file test.c)

for (i=0;i<1;i++){
 set (ptr + 64 * (p % (SIZE /64) + 60), ptr2 + 64 * (q % (SIZE /64) + 60));

First vectorize by following function. A vectorizer here does
peeling (assembly is bit long, see file set1.s)

void set(int *p, int *q){
  int i;
  for (i=0; i<128; i++)
 p[i] = 42 * p[i];
}

When ran it I got

$ gcc -O3 -DSIZE= test.c
$ gcc test.o set1.s
$ time ./a.out

real0m3.724s
user0m3.724s
sys 0m0.000s

Now what happens if we use separate input and output arrays? A gcc
vectorizer fortunately does not peel in this case (file set2.s) which
gives better performance

void set(int *p, int *q){
  int i;
  for (i=0; i<128; i++)
 p[i] = 42 * q[i];
}

$ gcc test.o set2.s
$ time ./a.out

real0m3.169s
user0m3.170s
sys 0m0.000s


A speedup here is can be partialy explained by fact that inplace
modifications run slower. To eliminate this possibility we change
assembly to make input same as output (file set3.s)

jb  .L15
 .L7:
xorl%eax, %eax
+   movq%rdi, %rsi
.p2align 4,,10
.p2align 3
 .L5:

$ gcc test.o set3.s
$ time ./a.out

real0m3.169s
user0m3.170s
sys 0m0.000s

Which is still faster than what peeling vectorizer generated.

And in this test I did not alignment is constant so branch misprediction
is not a issue.
#define _GNU_SOURCE
#include 
int main(){
   char *ptr = pvalloc(2 * SIZE + 128);
   char *ptr2 = pvalloc(2 * SIZE + 128);

   unsigned long p = 31;
   unsigned long q = 17;

   int i;
   for (i=0; i < 1; i++) {
 set (ptr + 64 * (p % (SIZE / 64) + 60), ptr2 + 64 * (q % (SIZE /64) + 60));
 p = 11 * p + 3;
 q = 13 * p + 5;
   }
}
.file   "set1.c"
.text
.p2align 4,,15
.globl  set
.type   set, @function
set:
.LFB0:
.cfi_startproc
leaq32(%rdi), %rax
cmpq%rax, %rsi
jb  .L12
movq  %rdi, %rsi
.L6:
vmovdqu (%rsi), %ymm1
vmovdqa .LC0(%rip), %ymm0
vpmulld %ymm0, %ymm1, %ymm1
vmovdqu %ymm1, (%rdi)
vmovdqu 32(%rsi), %ymm1
vpmulld %ymm0, %ymm1, %ymm1
vmovdqu %ymm1, 32(%rdi)
vmovdqu 64(%rsi), %ymm1
vpmulld %ymm0, %ymm1, %ymm1
vmovdqu %ymm1, 64(%rdi)
vmovdqu 96(%rsi), %ymm1
vpmulld %ymm0, %ymm1, %ymm1
vmovdqu %ymm1, 96(%rdi)
vmovdqu 128(%rsi), %ymm1
vpmulld %ymm0, %ymm1, %ymm1
vmovdqu %ymm1, 128(%rdi)
vmovdqu 160(%rsi), %ymm1
vpmulld %ymm0, %ymm1, %ymm1
vmovdqu %ymm1, 160(%rdi)
vmovdqu 192(%rsi), %ymm1
vpmulld %ymm0, %ymm1, %ymm1
vmovdqu %ymm1, 192(%rdi)
vmovdqu 224(%rsi), %ymm1
vpmulld %ymm0, %ymm1, %ymm1
vmovdqu %ymm1, 224(%rdi)
vmovdqu 256(%rsi), %ymm1
vpmulld %ymm0, %ymm1, %ymm1
vmovdqu %ymm1, 256(%rdi)
vmovdqu 288(%rsi), %ymm1
vpmulld %ymm0, %ymm1, %ymm1
vmovdqu %ymm1, 288(%rdi)
vmovdqu 320(%rsi), %ymm1
vpmulld %ymm0, %ymm1, %ymm1
vmovdqu %ymm1, 320(%rdi)
vmovdqu 352(%rsi), %ymm1
vpmulld %ymm0, %ymm1, %ymm1
vmovdqu %ymm1, 352(%rdi)
vmovdqu 384(%rsi), %ymm1
vpmulld

Re: Vectorization: Loop peeling with misaligned support.

2013-11-15 Thread Ondřej Bílka
On Fri, Nov 15, 2013 at 11:26:06PM +0100, Ondřej Bílka wrote:
Minor correction, a mutt read replaced a set1.s file by one that I later
used for avx2 variant. A correct file is following
.file   "set1.c"
.text
.p2align 4,,15
.globl  set
.type   set, @function
set:
.LFB0:
.cfi_startproc
movq%rdi, %rax
andl$15, %eax
shrq$2, %rax
negq%rax
andl$3, %eax
je  .L9
movl(%rdi), %edx
movl$42, %esi
imull   %esi, %edx
cmpl$1, %eax
movl%edx, (%rdi)
jbe .L10
movl4(%rdi), %edx
movl$42, %ecx
imull   %ecx, %edx
cmpl$2, %eax
movl%edx, 4(%rdi)
jbe .L11
movl8(%rdi), %edx
movl$42, %r11d
movl$125, %r10d
imull   %r11d, %edx
movl$3, %r11d
movl%edx, 8(%rdi)
.L2:
movl$128, %r8d
xorl%edx, %edx
subl%eax, %r8d
movl%eax, %eax
movl%r8d, %esi
leaq(%rdi,%rax,4), %rcx
xorl%eax, %eax
shrl$2, %esi
leal0(,%rsi,4), %r9d
.p2align 4,,10
.p2align 3
.L8:
movdqa  (%rcx,%rax), %xmm1
addl$1, %edx
pslld   $1, %xmm1
movdqa  %xmm1, %xmm0
pslld   $2, %xmm0
psubd   %xmm1, %xmm0
movdqa  %xmm0, %xmm1
pslld   $3, %xmm1
psubd   %xmm0, %xmm1
movdqa  %xmm1, (%rcx,%rax)
addq$16, %rax
cmpl%edx, %esi
ja  .L8
movl%r10d, %ecx
leal(%r11,%r9), %eax
subl%r9d, %ecx
cmpl%r9d, %r8d
je  .L1
movslq  %eax, %rdx
movl$42, %r9d
leaq(%rdi,%rdx,4), %rdx
movl(%rdx), %esi
imull   %r9d, %esi
cmpl$1, %ecx
movl%esi, (%rdx)
leal1(%rax), %edx
je  .L1
movslq  %edx, %rdx
movl$42, %r8d
addl$2, %eax
leaq(%rdi,%rdx,4), %rdx
movl(%rdx), %esi
imull   %r8d, %esi
cmpl$2, %ecx
movl%esi, (%rdx)
je  .L1
cltq
movl$42, %r10d
leaq(%rdi,%rax,4), %rax
movl(%rax), %edx
imull   %r10d, %edx
movl%edx, (%rax)
ret
.p2align 4,,10
.p2align 3
.L1:
rep ret
.p2align 4,,10
.p2align 3
.L9:
movl$128, %r10d
xorl%r11d, %r11d
jmp .L2
.p2align 4,,10
.p2align 3
.L11:
movl$126, %r10d
movl$2, %r11d
jmp .L2
.p2align 4,,10
.p2align 3
.L10:
movl$127, %r10d
movl$1, %r11d
jmp .L2
.cfi_endproc
.LFE0:
.size   set, .-set
.ident  "GCC: (Debian 4.8.1-10) 4.8.1"
.section.note.GNU-stack,"",@progbits


Re: proposal to make SIZE_TYPE more flexible

2013-11-15 Thread DJ Delorie

> Everything handling __int128 would be updated to work with a 
> target-determined set of types instead.
> 
> Preferably, the number of such keywords would be arbitrary (so I suppose 
> there would be a single RID_INTN for them) - that seems cleaner than the 
> system for address space keywords with a fixed block from RID_ADDR_SPACE_0 
> to RID_ADDR_SPACE_15.

I did a scan through the gcc source tree trying to track down all the
implications of this, and there were a lot of them, and not just the
RID_* stuff.  There's also the integer_types[] array (indexed by
itk_*, which is its own mess) and c_common_reswords[] array, for
example.

I think it might not be possible to have one RID_* map to multiple
actual keywords, as there are few cases that need to know *which* intN
is used *and* have access to the original string of the token, and
many cases where code assumes a 1:1 relation between RID_*, a type,
and a keyword string.

IMHO the key design choices come down to:

* Do we change a few global const arrays to be dynamic arrays?

* We need to consider that "position in array" is no longer a suitable
  sort key for these arrays.  itk_* comes to mind here, but RID_* are
  abused sometimes too.  (note: I've seen this before, where PSImode
  isn't included in "find smallest mode" logic, for example, because
  it's no in the array in the same place as SImode)

* Need to dynamically map keywords/bitsizes/tokens to types in all the
  cases where we explicitly check for int128.  Some of these places
  have explicit "check types in the right order" logic hard-coded that
  may need to be changed to a data-search logic.

* The C++ mangler needs to know what to do with these new types.

I'll attach my notes from the scan for reference...


Search for in128 ...
Search for c_common_reswords ...
Search for itk_ ...

--- . ---

tree-core.h

enum integer_type_kind is used to map all integer types "in
order" so we need an alternate way to map them.  Currently hard-codes
the itk_int128 types.

tree.h

defines int128_unsigned_type_node and int128_integer_type_node

uses itk_int128 and itk_unsigned_int128 - int128_*_type_node
is an [itk_*] array reference.

builtin-types.def

defines BT_INT182 but nothing uses it yet.

gimple.c

gimple_signed_or_unsigned_type maps types to their signed or
unsigned variant.  Two cases: one checks for int128
explicitly, the other checks for compatibility with int128.

tree.c

make_or_reuse_type maps size/signed to a
int128_integer_type_node etc.

build_common_tree_nodes makes int128_*_type_node if the target
supports TImode.

tree-streamer.c

preload_common_nodes() records one node per itk_*

--- LTO ---

lto.c

read_cgraph_and_symbols() reads one node per integer_types[itk_*]

--- C-FAMILY ---

c-lex.c

intepret_integer scans itk_* to find the best (smallest) type
for integers.

narrowest_unsigned_type assumes integer_types[itk_*] in
bit-size order, and assumes [N*2] is signed/unsigned pairs.

narrowest_signed_type: same.

c-cppbuiltin.c

__SIZEOF_INTn__ for each intN

c-pretty-print.c

prints I128 suffix for int128-sized integer literals.

c-common.c

int128_* has an entry in c_global_trees[]

c_common_reswords[] has an entry for __int128 -> RID_INT128

c_common_type_for_size maps int:128 to  int128_*_type_node

c_common_type_for_mode: same.

c_common_signed_or_unsigned_type - checks for int128 types.
same as igmple_signed_or_unsigned_type?()

c_build_bitfield_integer_type assigns int128_*_type_node for
:128 fields.

c_common_nodes_and_builtins maps int128_*_type_node to
RID_INT128 and "__int128".  Also maps to decl __int128_t

keyword_begins_type_specifier() checks for RID_INT128

--- C ---

c-tree.h

adds cts_int128 to c_typespec_keyword[]

c-parser.c

c_parse_init() reads c_common_reswords[] which has __int128,
maps one id to each RID_* code.

c_token_starts_typename() checks for RID_INT128

c_token_starts_declspecs() checks for RID_INT128

c_parser_declspecs() checks for RID_INT128

c_parser_attribute_any_word() checks for RID_INT128

c_parser_objc_selector() checks for RID_INT128

c-decl.c

error for "long __int128" etc throughout

declspecs_add_type() checks for RID_INT128

finish_declspecs() checks for cts_int128

--- FORTRAN ---

ico-c-binding.def

maps int128_t to c_int128_t via get_int_kind_from_width(

--- C++ ---

class.c

layout_class_types uses itk_* to find the best (smallest)
integer type for overlarge bitfields.

lex.c

init_reswords() reads c_common_reswords[], which includes __int128

rtti.c

emit_support_tinfos has a dummy list of types fund

Re: suspect code in fold-const.c

2013-11-15 Thread Kenneth Zadeck


This patch fixes a number of places where the mode bitsize had been used 
but the mode precision should have been used.  The tree level is 
somewhat sloppy about this - some places use the mode precision and some 
use the mode bitsize.   It seems that the mode precision is the proper 
choice since it does the correct thing if the underlying mode is a 
partial int mode.


This code has been tested on x86-64 with no regressions.   Ok to commit?



2013-11-15 Kenneth Zadeck 
* tree.c (int_fits_type_p): Change GET_MODE_BITSIZE to
GET_MODE_PRECISION.
* fold-const.c (fold_single_bit_test_into_sign_test,
fold_binary_loc):  Change GET_MODE_BITSIZE to
GET_MODE_PRECISION.

Kenny


On 11/15/2013 08:32 AM, Kenneth Zadeck wrote:

On 11/15/2013 04:07 AM, Eric Botcazou wrote:

this code from fold-const.c starts on line 13811.

  else if (TREE_INT_CST_HIGH (arg1) == signed_max_hi
   && TREE_INT_CST_LOW (arg1) == signed_max_lo
   && TYPE_UNSIGNED (arg1_type)
   /* We will flip the signedness of the comparison 
operator

  associated with the mode of arg1, so the sign bit is
  specified by this mode.  Check that arg1 is the signed
  max associated with this sign bit.  */
   && width == GET_MODE_BITSIZE (TYPE_MODE (arg1_type))
   /* signed_type does not work on pointer types. */
   && INTEGRAL_TYPE_P (arg1_type))

with width defined as:

unsigned int width = TYPE_PRECISION (arg1_type);


it seems that the check on bitsize should really be a check on the
precision of the variable.   If this seems right, i will correct 
this on

the trunk and make the appropriate changes to the wide-int branch.

Do you mean

   && width == GET_MODE_PRECISION (TYPE_MODE (arg1_type))

instead?  If so, that would probably make sense, but there are a few 
other
places with the same TYPE_PRECISION/GET_MODE_BITSIZE check, in 
particular the

very similar transformation done in fold_single_bit_test_into_sign_test.

yes.  I understand the need to do this check on the mode rather than 
the precision of the type itself.
The point is that if the mode under this type happens to be a partial 
int mode, then that sign bit may not even be where the bitsize points to.


However, having just done a few greps, it looks like this case was 
just the one that i found while doing the wide-int work, there may be 
several more of these cases.   Just in fold-const, there are a couple 
in fold_binary_loc.   The one in tree.c:int_fits_type_p looks 
particularly wrong.


I think that there are also several in tree-vect-patterns.c.

Kenny


Index: gcc/fold-const.c
===
--- gcc/fold-const.c	(revision 204842)
+++ gcc/fold-const.c	(working copy)
@@ -6593,7 +6593,7 @@ fold_single_bit_test_into_sign_test (loc
 	  /* This is only a win if casting to a signed type is cheap,
 	 i.e. when arg00's type is not a partial mode.  */
 	  && TYPE_PRECISION (TREE_TYPE (arg00))
-	 == GET_MODE_BITSIZE (TYPE_MODE (TREE_TYPE (arg00
+	 == GET_MODE_PRECISION (TYPE_MODE (TREE_TYPE (arg00
 	{
 	  tree stype = signed_type_for (TREE_TYPE (arg00));
 	  return fold_build2_loc (loc, code == EQ_EXPR ? GE_EXPR : LT_EXPR,
@@ -12050,7 +12050,7 @@ fold_binary_loc (location_t loc,
 	zerobits = unsigned HOST_WIDE_INT) 1) << shiftc) - 1);
 	  else if (TREE_CODE (arg0) == RSHIFT_EXPR
 		   && TYPE_PRECISION (TREE_TYPE (arg0))
-		  == GET_MODE_BITSIZE (TYPE_MODE (TREE_TYPE (arg0
+		  == GET_MODE_PRECISION (TYPE_MODE (TREE_TYPE (arg0
 	{
 	  prec = TYPE_PRECISION (TREE_TYPE (arg0));
 	  tree arg00 = TREE_OPERAND (arg0, 0);
@@ -12061,7 +12061,7 @@ fold_binary_loc (location_t loc,
 		{
 		  tree inner_type = TREE_TYPE (TREE_OPERAND (arg00, 0));
 		  if (TYPE_PRECISION (inner_type)
-		  == GET_MODE_BITSIZE (TYPE_MODE (inner_type))
+		  == GET_MODE_PRECISION (TYPE_MODE (inner_type))
 		  && TYPE_PRECISION (inner_type) < prec)
 		{
 		  prec = TYPE_PRECISION (inner_type);
@@ -13816,7 +13816,7 @@ fold_binary_loc (location_t loc,
 			associated with the mode of arg1, so the sign bit is
 			specified by this mode.  Check that arg1 is the signed
 			max associated with this sign bit.  */
-		 && width == GET_MODE_BITSIZE (TYPE_MODE (arg1_type))
+		 && width == GET_MODE_PRECISION (TYPE_MODE (arg1_type))
 		 /* signed_type does not work on pointer types.  */
 		 && INTEGRAL_TYPE_P (arg1_type))
 	  {
Index: gcc/tree.c
===
--- gcc/tree.c	(revision 204842)
+++ gcc/tree.c	(working copy)
@@ -8614,7 +8614,7 @@ retry:
   /* Third, unsigned integers with top bit set never fit signed types.  */
   if (! TYPE_UNSIGNED (type) && unsc)
 {
-  int prec = GET_MODE_BITSIZE (TYPE_MODE (TREE_TYPE (c))) - 1;
+  int prec = GET_MODE_PRECISION (TYPE_MODE (TREE_TYPE (c))) - 1;

Thank you!

2013-11-15 Thread Mark Mitchell
Folks --

It's been a long time since I've posted to the GCC mailing list because (as is 
rather obvious) I haven't been directly involved in GCC development for quite 
some time.  As of today, I'm no longer at Mentor Graphics (the company that 
acquired CodeSourcery), so I no longer even have a management role in a company 
involved in GCC development.  And, as I don't have plans to be involved in GCC 
development in the foreseeable future, it seems best to admit that I'm no 
longer a maintainer of GCC.  I've also tendered my resignation to the GCC 
Steering Committee.  David Edelsohn has kindly agreed to make the requisite 
changes to the MAINTAINERS file on my behalf.  

GCC has been an interest of mine for a very long time, beginning with the point 
at which I convinced a previous employer to deploy it as a cross-platform 
compiler solution because we were having so many problems with incompatibility 
between the various proprietary compilers we were using.  Of course, GCC itself 
still had a few bugs left at that point, so I fixed one or two, and, later, 
when I should have been writing papers in graduate school, I implemented some 
C++ template features (with much help from Jason Merrill and others) instead, 
and, eventually became very involved in the development of GCC.  I'll of course 
remain interested in GCC, even if more as an observer than as a participant!

I'd very much like to thank all who are, have been, or will be developers and 
maintainers of GCC.  Of course, I'm particularly grateful to those who reviewed 
my patches, fixed the bugs I introduced, endured my nit-picking reviews of 
their patches, and so forth.  But, there are literally hundreds of you -- 
perhaps thousands -- who have contributed, and I'd like to thank all of you; 
your contributions and your community gave me the opportunity to have a ton of 
fun.

Thank you,

--
Mark Mitchell



Re: Vectorization: Loop peeling with misaligned support.

2013-11-15 Thread Tim Prince

On 11/15/2013 2:26 PM, Ondřej Bílka wrote:

On Fri, Nov 15, 2013 at 09:17:14AM -0800, Hendrik Greving wrote:

Also keep in mind that usually costs go up significantly if
misalignment causes cache line splits (processor will fetch 2 lines).
There are non-linear costs of filling up the store queue in modern
out-of-order processors (x86). Bottom line is that it's much better to
peel e.g. for AVX2/AVX3 if the loop would cause loads that cross cache
line boundaries otherwise. The solution is to either actually always
peel for alignment, or insert an additional check for cache line
boundaries (for high trip count loops).

That is quite bold claim do you have a benchmark to support that?

Since nehalem there is no overhead of unaligned sse loads except of fetching
cache lines. As haswell avx2 loads behave in similar way.
Where gcc or gfortran choose to split sse2 or sse4 loads, I found a 
marked advantage in that choice on my Westmere (which I seldom power on 
nowadays). You are correct that this finding is in disagreement with 
Intel documentation, and it has the effect that Intel option -xHost is 
not the optimum one. I suspect the Westmere was less well performing 
than Nehalem on unaligned loads. Another poorly documented feature of 
Nehalem and Westmere was a preference for 32-byte aligned data, more so 
than Sandy Bridge.
Intel documentation encourages use of unaligned AVX-256 loads on Ivy 
Bridge and Haswell, but Intel compilers don't implement them (except for 
intrinsics) until AVX2. Still, on my own Haswell tests, the splitting of 
unaligned loads by use of AVX compile option comes out ahead. 
Supposedly, the preference of Windows intrinsics programmers for the 
relative simplicity of unaligned moves was taken into account in the 
more recent hardware designs, as it was disastrous for Sandy Bridge.
I have only remote access to Haswell although I plan to buy a laptop 
soon. I'm skeptical about whether useful findings on these points may be 
obtained on a Windows laptop.
In case you didn't notice it, Intel compilers introduced #pragma vector 
unaligned as a means to specify handling of unaligned access without 
peeling. I guess it is expected to be useful on Ivy Bridge or Haswell 
for cases where the loop count is moderate but expected to match 
unrolled AVX-256, or if the case where peeling can improve alignment is 
rare.
In addition, Intel compilers learned from gcc the trick of using AVX-128 
for situations where frequent unaligned accesses are expected and 
peeling is clearly undesirable. The new facility for vectorizing OpenMP 
parallel loops (e.g. #pragma omp parallel for simd) uses AVX-128, 
consistent with the fact that OpenMP chunks are more frequently 
unaligned. In fact, parallel for simd seems to perform nearly the same 
with gcc-4.9 as with icc.
Many decisions on compiler defaults still are based on an unscientific 
choice of benchmarks, with gcc evidently more responsive to input from 
the community.


--
Tim Prince



Re: Thank you!

2013-11-15 Thread Jeff Law

On 11/15/13 18:33, Mark Mitchell wrote:


I'd very much like to thank all who are, have been, or will be
developers and maintainers of GCC.  Of course, I'm particularly
grateful to those who reviewed my patches, fixed the bugs I
introduced, endured my nit-picking reviews of their patches, and so
forth.  But, there are literally hundreds of you -- perhaps thousands
-- who have contributed, and I'd like to thank all of you; your
contributions and your community gave me the opportunity to have a
ton of fun.

And on behalf of the entire GCC community I'd like to thank you.

You played so many roles for GCC through the years.  C++ co-maintainer, 
release manager, steering committee member, global reviewer, project 
leader, etc and excelled at each one.


In particular I'd like to call out your time as release manager.  I 
suspect some of the newer folks aren't aware of how desperately we 
needed someone to step into that role.  I had completely burnt out and 
GCC was floundering with not even a plan for how the next release was 
going to get out the door.


You volunteered for an often thankless job and did far more as a release 
manager than I could ever have envisioned.  You managed to herd a 
diverse group of developers, often with conflicting agendas, &  without 
any true power over how they spent their time.


During your time as release manager you really became a leader for the 
whole project and with your guidance GCC found continued success.



Even though I've been expecting this for a while, I'm still terribly sad 
to see you go.  As I'm sure you know, you're always welcome here in any 
capacity you choose to engage!


Thanks again,

jeff