Re: Vectorization: Loop peeling with misaligned support.

2013-11-16 Thread Richard Biener
"Ondřej Bílka"  wrote:
>On Fri, Nov 15, 2013 at 09:17:14AM -0800, Hendrik Greving wrote:
>> Also keep in mind that usually costs go up significantly if
>> misalignment causes cache line splits (processor will fetch 2 lines).
>> There are non-linear costs of filling up the store queue in modern
>> out-of-order processors (x86). Bottom line is that it's much better
>to
>> peel e.g. for AVX2/AVX3 if the loop would cause loads that cross
>cache
>> line boundaries otherwise. The solution is to either actually always
>> peel for alignment, or insert an additional check for cache line
>> boundaries (for high trip count loops).
>
>That is quite bold claim do you have a benchmark to support that?
>
>Since nehalem there is no overhead of unaligned sse loads except of
>fetching
>cache lines. As haswell avx2 loads behave in similar way.
>
>You are forgetting that loop needs both cache lines when it issues
>unaligned load. This will generaly take maximum of times needed to
>access these lines. Now with peeling you accesss first cache line, and
>after that in loop access the second, effectively doubling running time
>when both lines were in main memory.
>
>You also need to compute all factors not just that one factor is
>expensive. There are several factor in plays, cost of branch
>misprediction is main argument againist doing peeling, so you need to
>show that cost of unaligned loads is bigger than cost of branch
>misprediction of a peeled implementation.
>
>As a quick example why peeling is generaly bad idea I did a simple
>benchmark. Could somebody with haswell also test attached code
>generated
>by gcc -O3 -march=core-avx2 (files set[13]_avx2.s)?
>
>For the test we repeately call a function set with a pointer randomly
>picked from 262144 bytes to stress a L2 cache, relevant tester 
>is following (file test.c)
>
>for (i=0;i<1;i++){
>set (ptr + 64 * (p % (SIZE /64) + 60), ptr2 + 64 * (q % (SIZE /64) +
>60));
>
>First vectorize by following function. A vectorizer here does
>peeling (assembly is bit long, see file set1.s)
>
>void set(int *p, int *q){
>  int i;
>  for (i=0; i<128; i++)
> p[i] = 42 * p[i];
>}
>
>When ran it I got
>
>$ gcc -O3 -DSIZE= test.c
>$ gcc test.o set1.s
>$ time ./a.out
>
>real   0m3.724s
>user   0m3.724s
>sys0m0.000s
>
>Now what happens if we use separate input and output arrays? A gcc
>vectorizer fortunately does not peel in this case (file set2.s) which
>gives better performance
>
>void set(int *p, int *q){
>  int i;
>  for (i=0; i<128; i++)
> p[i] = 42 * q[i];
>}
>
>$ gcc test.o set2.s
>$ time ./a.out
>
>real   0m3.169s
>user   0m3.170s
>sys0m0.000s
>
>
>A speedup here is can be partialy explained by fact that inplace
>modifications run slower. To eliminate this possibility we change
>assembly to make input same as output (file set3.s)
>
>   jb  .L15
> .L7:
>   xorl%eax, %eax
>+  movq%rdi, %rsi
>   .p2align 4,,10
>   .p2align 3
> .L5:
>
>$ gcc test.o set3.s
>$ time ./a.out
>
>real   0m3.169s
>user   0m3.170s
>sys0m0.000s
>
>Which is still faster than what peeling vectorizer generated.
>
>And in this test I did not alignment is constant so branch
>misprediction
>is not a issue.

IIRC what can still be seen is store-buffer related slowdowns when you have a 
big unaligned store load in your loop.  Thus aligning stores still pays back 
last time I measured this.

Richard.




Re: proposal to make SIZE_TYPE more flexible

2013-11-16 Thread Richard Biener
DJ Delorie  wrote:
>
>> Everything handling __int128 would be updated to work with a 
>> target-determined set of types instead.
>> 
>> Preferably, the number of such keywords would be arbitrary (so I
>suppose 
>> there would be a single RID_INTN for them) - that seems cleaner than
>the 
>> system for address space keywords with a fixed block from
>RID_ADDR_SPACE_0 
>> to RID_ADDR_SPACE_15.
>
>I did a scan through the gcc source tree trying to track down all the
>implications of this, and there were a lot of them, and not just the
>RID_* stuff.  There's also the integer_types[] array (indexed by
>itk_*, which is its own mess)

I don't think we need this. It shoul be split into frontend parts and what we 
consider part of the C ABI of the target.

 and c_common_reswords[] array, for
>example.
>
>I think it might not be possible to have one RID_* map to multiple
>actual keywords, as there are few cases that need to know *which* intN
>is used *and* have access to the original string of the token, and
>many cases where code assumes a 1:1 relation between RID_*, a type,
>and a keyword string.
>
>IMHO the key design choices come down to:
>
>* Do we change a few global const arrays to be dynamic arrays?
>
>* We need to consider that "position in array" is no longer a suitable
>  sort key for these arrays.  itk_* comes to mind here, but RID_* are
>  abused sometimes too.  (note: I've seen this before, where PSImode
>  isn't included in "find smallest mode" logic, for example, because
>  it's no in the array in the same place as SImode)
>
>* Need to dynamically map keywords/bitsizes/tokens to types in all the
>  cases where we explicitly check for int128.  Some of these places
>  have explicit "check types in the right order" logic hard-coded that
>  may need to be changed to a data-search logic.
>
>* The C++ mangler needs to know what to do with these new types.
>
>I'll attach my notes from the scan for reference...
>
>
>Search for in128 ...
>Search for c_common_reswords ...
>Search for itk_ ...
>
>--- . ---
>
>tree-core.h
>
>   enum integer_type_kind is used to map all integer types "in
>   order" so we need an alternate way to map them.  Currently hard-codes
>   the itk_int128 types.
>
>tree.h
>
>   defines int128_unsigned_type_node and int128_integer_type_node
>
>   uses itk_int128 and itk_unsigned_int128 - int128_*_type_node
>   is an [itk_*] array reference.
>
>builtin-types.def
>
>   defines BT_INT182 but nothing uses it yet.
>
>gimple.c
>
>   gimple_signed_or_unsigned_type maps types to their signed or
>   unsigned variant.  Two cases: one checks for int128
>   explicitly, the other checks for compatibility with int128.
>
>tree.c
>
>   make_or_reuse_type maps size/signed to a
>   int128_integer_type_node etc.
>
>   build_common_tree_nodes makes int128_*_type_node if the target
>   supports TImode.
>
>tree-streamer.c
>
>   preload_common_nodes() records one node per itk_*
>
>--- LTO ---
>
>lto.c
>
>   read_cgraph_and_symbols() reads one node per integer_types[itk_*]
>
>--- C-FAMILY ---
>
>c-lex.c
>
>   intepret_integer scans itk_* to find the best (smallest) type
>   for integers.
>
>   narrowest_unsigned_type assumes integer_types[itk_*] in
>   bit-size order, and assumes [N*2] is signed/unsigned pairs.
>
>   narrowest_signed_type: same.
>
>c-cppbuiltin.c
>
>   __SIZEOF_INTn__ for each intN
>
>c-pretty-print.c
>
>   prints I128 suffix for int128-sized integer literals.
>
>c-common.c
>
>   int128_* has an entry in c_global_trees[]
>
>   c_common_reswords[] has an entry for __int128 -> RID_INT128
>
>   c_common_type_for_size maps int:128 to  int128_*_type_node
>
>   c_common_type_for_mode: same.
>
>   c_common_signed_or_unsigned_type - checks for int128 types.
>   same as igmple_signed_or_unsigned_type?()
>
>   c_build_bitfield_integer_type assigns int128_*_type_node for
>   :128 fields.
>
>   c_common_nodes_and_builtins maps int128_*_type_node to
>   RID_INT128 and "__int128".  Also maps to decl __int128_t
>
>   keyword_begins_type_specifier() checks for RID_INT128
>
>--- C ---
>
>c-tree.h
>
>   adds cts_int128 to c_typespec_keyword[]
>
>c-parser.c
>
>   c_parse_init() reads c_common_reswords[] which has __int128,
>   maps one id to each RID_* code.
>
>   c_token_starts_typename() checks for RID_INT128
>
>   c_token_starts_declspecs() checks for RID_INT128
>
>   c_parser_declspecs() checks for RID_INT128
>
>   c_parser_attribute_any_word() checks for RID_INT128
>
>   c_parser_objc_selector() checks for RID_INT128
>
>c-decl.c
>
>   error for "long __int128" etc throughout
>
>   declspecs_add_type() checks for RID_INT128
>
>   finish_declspecs() checks for cts_int128
>
>--- FORTRAN ---
>
>ico-c-binding.def
>
>   maps int128_t to c_int128_t via get_int_kind_from_width(
>
>--- C++ ---
>
>clas

Re: Vectorization: Loop peeling with misaligned support.

2013-11-16 Thread Ondřej Bílka
On Sat, Nov 16, 2013 at 11:37:36AM +0100, Richard Biener wrote:
> "Ondřej Bílka"  wrote:
> >On Fri, Nov 15, 2013 at 09:17:14AM -0800, Hendrik Greving wrote:
> 
> IIRC what can still be seen is store-buffer related slowdowns when you have a 
> big unaligned store load in your loop.  Thus aligning stores still pays back 
> last time I measured this.

Then send you benchmark. What I did is a loop that stores 512 bytes. Unaligned 
stores there are faster than aligned ones, so tell me when aligning stores pays 
itself. Note that in filling store buffer you must take into account extra 
stores to make loop aligned.

Also what do you do with loops that contain no store? If I modify test to

int set(int *p, int *q){
  int i;
  int sum = 0;
  for (i=0; i < 128; i++)
 sum += 42 * p[i];
  return sum;
}

then it still does aligning.

There may be a threshold after which aligning buffer makes sense then you
need to show that loop spend most of time on sizes after that treshold.

Also do you have data how common store-buffer slowdowns are? Without
knowing that you risk that you make few loops faster at expense of
majority which could likely slow whole application down. It would not
supprise me as these loops can be ran mostly on L1 cache data (which is
around same level as assuming that increased code size fits into instruction 
cache.)


Actually these questions could be answered by a test, first compile
SPEC2006 with vanilla gcc -O3 and then with gcc that contains patch to
use unaligned loads. Then results will tell if peeling is also good in
practice or not.


Re: proposal to make SIZE_TYPE more flexible

2013-11-16 Thread Joseph S. Myers
On Sat, 16 Nov 2013, Richard Biener wrote:

> >I did a scan through the gcc source tree trying to track down all the
> >implications of this, and there were a lot of them, and not just the
> >RID_* stuff.  There's also the integer_types[] array (indexed by
> >itk_*, which is its own mess)
> 
> I don't think we need this. It shoul be split into frontend parts and 
> what we consider part of the C ABI of the target.

Indeed, most middle-end references to particular C types are suspect (and 
likewise references to particular modes such as SImode, other than QImode 
which is always the target byte of BITS_PER_UNIT bits).  As regards the 
ABI, I've previously suggested a hook taking a subset of itk_* values are 
a replacement for target macros such as INT_TYPE_SIZE (cf Joern's patches, 
Nov/Dec 2010, and note that in various places it may be better to use 
TYPE_PRECISION (integer_type_node) etc. if the type nodes have been 
initialized by then).  But such a hook doesn't need to handle __intN types 
since it's part of the definition of __intN that it takes N bits.

I suppose global type nodes may be needed outside front ends for dealing 
with built-in function interfaces if nothing else, but for __intN you 
might get an interface "return the signed / unsigned __intN type" rather 
than macros like int128_integer_type_node.

(Looking at some examples of middle-end code inappropriately using 
particular C ABI types: the code in tree-ssa-loop-ivopts.c needs some way 
to iterate over "integer types that are efficient on the target"; what 
types the ABI says are int / long / long long should be irrelevant.  Code 
in convert.c is an example of a trickier case, where the enums for 
built-in functions embed information about the integer and floating-point 
types involved and a better way would be needed to extract the underlying 
operation for a built-in function and identify the function, if any, for 
an arbitrary type or pair of types.)

-- 
Joseph S. Myers
jos...@codesourcery.com


Spamming the gcc-testresults mailing list (or not).

2013-11-16 Thread Toon Moene

I have now got two of these:

- - - - - - - - - 8< - - - - - - - - - 8< - - - - - - - - -

Hi. This is the qmail-send program at sourceware.org.
I'm afraid I wasn't able to deliver your message to the following addresses.
This is a permanent error; I've given up. Sorry it didn't work out.

:
In an effort to cut down on our spam intake, we block email that is
detected as spam by the SpamAssassin program.  Your email was flagged as
spam by that program.  See: http://spamassassin.apache.org/ for more
details.
See http://sourceware.org/lists.html#sourceware-list-info for more 
information.


If you are not a "spammer", we apologize for the inconvenience.
You can add yourself to the gcc.gnu.org "global allow list"
by sending email *from*the*blocked*email*address* to:

global-allow-subscribe-toon=moene@gcc.gnu.org

For certain types of blocks, this will enable you to send email without
being subjected to further spam blocking.  This will not allow you to
post to a list if you have been explicitly blocked, if you are posting
an off-topic message, if you are sending an attachment that looks like a
virus, etc.

Contact gcc-testresults-ow...@gcc.gnu.org if you have questions about 
this. (#5.7.2)


--- Below this line is a copy of the message.

Return-Path: 
Received: (qmail 30906 invoked by uid 89); 14 Nov 2013 13:20:43 -
Authentication-Results: sourceware.org; auth=none
X-Virus-Checked: by ClamAV 0.98 on sourceware.org
X-Virus-Found: No
X-Spam-Flag: YES
X-Spam-SWARE-Status: Yes, score=5.7 required=5.0 
tests=AWL,BAYES_99,KAM_STOCKTIP,RDNS_NONE,URIBL_BLOCKED autolearn=no 
version=3.3.2
X-Spam-Status: Yes, score=5.7 required=5.0 
tests=AWL,BAYES_99,KAM_STOCKTIP,RDNS_NONE,URIBL_BLOCKED autolearn=no 
version=3.3.2

X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on sourceware.org
X-Spam-Level: *
X-HELO: moene.org
Received: from Unknown (HELO moene.org) (80.101.130.238)
 by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-SHA 
encrypted) ESMTPS; Thu, 14 Nov 2013 13:20:16 +

Received: from toon by moene.org with local (Exim 4.80)
(envelope-from )
id 1Vgwq6-00047o-5E
for gcc-testresu...@gcc.gnu.org; Thu, 14 Nov 2013 14:20:06 +0100
To: gcc-testresu...@gcc.gnu.org
Subject: FAILED: Bootstrap (build config: ubsan; languages: fortran; 
trunk revision 204790) on x86_64-unknown-linux-gnu

Message-Id: 
From: Toon Moene 
Date: Thu, 14 Nov 2013 14:20:06 +0100

none needed
if [ x"-fpic" != x ]; then \
	  /scratch/toon/bd4979/./prev-gcc/xgcc 
-B/scratch/toon/bd4979/./prev-gcc/ 
-B/home/toon/compilers/install/x86_64-unknown-linux-


- - - - - - - - - 8< - - - - - - - - - 8< - - - - - - - - -

whereas the one of today succeeded in getting through:

http://gcc.gnu.org/ml/gcc-testresults/2013-11/msg01098.html

Do I need to worry ?

--
Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/
Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news


OpenACC in GCC - how does it not violate the license?

2013-11-16 Thread Alec Teal

Hey all,

I got linked this by a friend today:
http://www.mentor.com/embedded-software/blog/post/we-are-bringing-openacc-to-the-gnu-compiler-suite--8d06289f-c4e9-44c8-801b-7a11496e7300

It seems to suggest that GCC can target Nvidia GPUs
To quote:
or OpenACC 2.0 in GCC, , and generating assembly level instructions 
for an NVIDIA GPU. Let’s not underestimate the effort involved, which 
includes teaching GCC how to parse OpenACC directives, how to 
translate the directives into appropriate blocks of code and data 
migration, and how to generate instructions for the target device itself.
Now while great, is this true!? Nvidia (IIRC, this was like a year ago 
though) don't even give out the instruction set for their GPUs, can we 
have GCC targeting closed things? Also there (must be and is) a Cuda 
runtime, wouldn't we need an open runtime to link against?


To quote again:
Duncan Poole is responsible for strategic partnerships for NVIDIA’s 
Accelerated Computing Division. His responsibilities reach across the 
developer tool chain

(the stuff after that quote is just guff)

This is by no means an accusation, I'm sure he's doing fine work; but 
he's doing something I didn't think the GPLv3 allowed (so I want to be 
corrected) he seems to have added something that requires a closed 
runtime for a target with a closed instruction set - probably for Nvidia 
(as he is responsible for "strategic partnerships" with them)


I do try and live my life entirely within free software, it means I 
never have to care about these things. Sorry for my ignorance.


Also a search for OpenACC produced very little.

Alec





Re: Enable -Wreturn-type by default ?

2013-11-16 Thread Alec Teal

Who isn't compiling with -Wall and -Wextra?

I do hope Clang ('though I don't use it) doesn't make it an error 
because not all functions have to return in C iirc.


Alec

On 13/11/13 16:42, Sylvestre Ledru wrote:

Hello,

I would like to propose the activation by default of -Wreturn-type.

The main objective would to provide a warning for such code:

int foo() {
return;
}

For now, it is only enabled when we have -Wall:
$ gcc -c foo.c
$ gcc -Wall -c foo.c
foo.c: In function ‘foo’:
foo.c:2:2: warning: ‘return’ with no value, in function returning
non-void [-Wreturn-type]


I already wrote the patch but at lot of tests need to be updated to pass
with this change.
It is why, before starting updating all of them, I would like to know if
there is a consensus here.

This bug discuss about this:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55189 The idea seems to be
accepted.
Clang is considering that this kind of mistake as an error.

Sylvestre





Re: OpenACC in GCC - how does it not violate the license?

2013-11-16 Thread Jeff Law

On 11/16/13 21:58, Alec Teal wrote:

Now while great, is this true!? Nvidia (IIRC, this was like a year ago
though) don't even give out the instruction set for their GPUs, can we
have GCC targeting closed things? Also there (must be and is) a Cuda
runtime, wouldn't we need an open runtime to link against?
The various projects looking at supporting OpenACC are, to the best of 
my knowledge, targeting PTX, which is a virtual ISA from NVidia which is 
published.


Going from PTX to the actual instructions for the particular GPU is the 
job of a runtime system which would be provided by NVidia.


However, there's no reason why OpenACC couldn't target the host CPU or 
another GPU.  In fact, that's what I'd initially do if I were working on 
this.




This is by no means an accusation, I'm sure he's doing fine work; but
he's doing something I didn't think the GPLv3 allowed (so I want to be
corrected) he seems to have added something that requires a closed
runtime for a target with a closed instruction set - probably for Nvidia
(as he is responsible for "strategic partnerships" with them)

To answer that question you'd need to talk to your lawyer.

jeff


Re: OpenACC in GCC - how does it not violate the license?

2013-11-16 Thread Dmitry Mikushin
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi Alec,

> Nvidia (IIRC, this was like a year ago though) don't even give out
the instruction set for their GPUs

I understand you don't want to bound to PTX virtual assembler, as it
conversion to GPU native assembler relies on proprietary component. We
too.

According to our experince [1], complete decoding of NVIDIA ISA could
be performed in reasonable time using NVIDIA's disassembler,
automation tools and some binary-level hooking. Same things should be
done at Pathscale and Nouveau. If disassembler will also be provided
for Maxwell, then OpenACC implementation would have enough to support
native ISAs of all known GPU families from 2008 to 2014.

GPU driver/runtime is more trickier, but there is Gdev project [2]

[1]
http://hpcforge.org/scm/viewvc.php/*checkout*/doc/opennvisa/opennvisa.pdf?root=kernelgen
[2] https://github.com/shinpei0208/gdev

- - D.

On 11/17/2013 05:58 AM, Alec Teal wrote:
> Hey all,
> 
> I got linked this by a friend today: 
> http://www.mentor.com/embedded-software/blog/post/we-are-bringing-openacc-to-the-gnu-compiler-suite--8d06289f-c4e9-44c8-801b-7a11496e7300
>
> 
> 
> It seems to suggest that GCC can target Nvidia GPUs To quote:
>> or OpenACC 2.0 in GCC, , and generating assembly level
>> instructions for an NVIDIA GPU. Let’s not underestimate the
>> effort involved, which includes teaching GCC how to parse OpenACC
>> directives, how to translate the directives into appropriate
>> blocks of code and data migration, and how to generate
>> instructions for the target device itself.
> Now while great, is this true!? Nvidia (IIRC, this was like a year
> ago though) don't even give out the instruction set for their GPUs,
> can we have GCC targeting closed things? Also there (must be and
> is) a Cuda runtime, wouldn't we need an open runtime to link
> against?
> 
> To quote again:
>> Duncan Poole is responsible for strategic partnerships for
>> NVIDIA’s Accelerated Computing Division. His responsibilities
>> reach across the developer tool chain
> (the stuff after that quote is just guff)
> 
> This is by no means an accusation, I'm sure he's doing fine work;
> but he's doing something I didn't think the GPLv3 allowed (so I
> want to be corrected) he seems to have added something that
> requires a closed runtime for a target with a closed instruction
> set - probably for Nvidia (as he is responsible for "strategic
> partnerships" with them)
> 
> I do try and live my life entirely within free software, it means
> I never have to care about these things. Sorry for my ignorance.
> 
> Also a search for OpenACC produced very little.
> 
> Alec
> 
> 
> 

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with undefined - http://www.enigmail.net/

iQEcBAEBAgAGBQJSiFSoAAoJENwm3+sbf/pMZHoH+wZjlFIo8U+jbAtPj5enuo4F
bFOBm/oJAikbB5UkHapG8UDwnpdnkaD7yba6Si4bT7SI/mtUQVMjctL3rmSur8r7
qIikj/S0R/DUj9RBq2li6w+SEYIN4nu/fqIbNNrj8KonR1ROeLwuiv3F1MQsOyZ/
e/kW0mN/0PioX1kB0jwvPFwxjDQPhHmHY1LZvdU1ZaQoCwSlujO+efJQ9ass8T8a
Z6SmFlKKdlT8JotnWrdTBOV2wVsRVyc8hER3z8izbE81DFOE+RwAjuZHsJLxwF/1
esmwKQgoOv7Lu3p+dH7i6jgEh+FzmI5wCPJ51SupFGH+6r3VAKNGmGxcpK4wODM=
=1lgD
-END PGP SIGNATURE-