Re: GCC 4.7.0 Status Report (2011-10-27), Stage 1 will end Nov 7th

2011-11-15 Thread Michael Zolotukhin
Hello!

x86-specific part of this patch was committed to the trunk recently.
There is also target-independent part, which covers memset/memcopy for
the smallest sizes (from 1 to ~256 bytes). In contrast to existing
implementation, it has a cost model to choose the fastest move-mode
(which could be a vector move-mode). This helps to increase the
performance on small sizes - these cases are especially important,
because libcalls can't be efficiently used here due to call overheads.

Could anyone from middle-end maintainers review it, when I updated it
to the latest changes?

Thanks!

On 27 October 2011 17:24, Uros Bizjak  wrote:
> Hello!
>
>> The GCC trunk is still in stage1.  Stage1 will last until
>> Nov 7th (including, use your timezone to your advantage) after
>> which we will have been in stage1 for nearth 8 months.
>> In stage3 the trunk will be open for general bugfixing, no
>> new features will be accepted.
>
> There is a patch that implements usage of vector instructions in
> memmov/memset expanding [1]. The patch was not reviewed for quite some
> time, but IIRC, we said that patches that were submitted before Stage
> 1 closes are still eligible for later stages (after a review of
> course).
>
> I think that this feature certainly improves gcc (also taking into
> account recent glibc changes in this area), and IMO implements an
> important feature for recent processors. I would like to motivate
> middle-end and target maintainers to consider the patch for a review
> before stage 1 closes, and ultimately ask Release Managers to decide
> how to proceed with this patch.
>
> [1] http://gcc.gnu.org/ml/gcc-patches/2011-10/msg02392.html
>
> Thanks,
> Uros.
>



-- 
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.


Re: [RFC] Offloading Support in libgomp

2013-09-13 Thread Michael Zolotukhin
Hi Jakub et al.,
We prepared a draft for design document for offloading support in GCC - could
you please take a look?  It is intended to give a common comprehension of what
is going on in this part.

We might publish it to a GCC wiki, if it is ok.  And later we could fill it with
more details if needed.

Here it is:
--
CONTENTS

1.  High level view on the compilation process with openmp plugins
1.1.  Compilation
1.2.  Linking
1.3.  Execution
2.  Linker plugins infrastructure
2.1.  Overview
2.2.  Multi-target support
3.  OpenMP pragma target handling in middle-end
4.  Runtime support in libGOMP
4.1.  General interface for offloading
4.2.  Maintaining info about mapped regions
4.3.  Preparing arguments for offloaded calls
4.4.  Plugins, usage of device-specific interfaces

1. HIGH LEVEL VIEW ON THE COMPILATION PROCESS WITH OPENMP PLUGINS

1.1.  Compilation

When host version of GCC compiles a file, the following stages happen:
  * After OpenMP pragmas lowering and expanding a new outlined function with
'target'-attribute emerges - it later will be compiled both by host and target
GCC to produce two versions (or N+1 versions in case of N different targets).
  * Expanding replaces pragmas with corresponding calls to runtime library
(libgomp).  These calls are preceded by initialization of special structures,
containing arguments for outlined routines - that is done similar to 'pragma
omp parallel' processing.
  * Gimple for routines with 'target' attribute is streamed into a special
section of the assembler (similar to LTO-sections).
  * Usual compilation continues, producing host-side assembler.
  * Assembler generates a FAT-object, containing host-side code and Gimple IR
for the outlined functions (they were marked with 'target' attribute).

TODO: add something about routines and variables inside 'pragma declare target'.

1.2.  Linking

When all source files are compiled, a linker is invoked.  The linker is passed
a special option to invoke openmp-plugin.  The plugin is responsible for
producing target-side executables - for each target it calls the corresponding
target compiler and linker.
The target-side GCC is invoked to load Gimple IR from .gnu.target_lto sections
of the FAT-object and compile it to target-side objects which later will be
used by target-side linker.

The host-side linker needs libgomp along side with standard libraries like
libc/libm to successfully resolve symbols, generated by the host compiler.  The
target-side linker needs CRT.O, containing main-routine for target-side
executable and target-specific versions of standard libraries.

As a result of the work, the plugin produces N target executables and exits,
allowing the host linker to continue its work and produce host-side executable.

TBD: Should the main routine always contain a message-waiting loop (like in COI
implementation) or other options are also possible?
TBD: probably, it's better to have a separate plugin for each target, that a
single openmp plugin.

1.3.  Execution

Host-side executable contains calls to libgomp library, which interfaces all
interactions with target-devices.
On loading, the executable calls GOMP_target_init from libgomp.so, which will
load the target executables onto target-devices and start them.  Since this
moment, the devices are ready to execute requested code and interact with the
main host-process.

When a host-side program calls libgomp functions related to the offloading,
libgomp decides, whether it's profitable to offload, and which device to choose
for that.  In order to do that, libgomp calls available plugins and checks
which devices are ready to execute offloaded code.  Available plugins should be
located in a specified folder and should implement a certain interface.

Another important function of libgomp is host-target memory mapping and keeping
information about mapped regions and their types.

TBD: probably, it's better to call GOMP_target_init on the first attempt to
offload something to the given device.
TBD: probably, it's better to 'hard-code' available plugin during build of
libgomp (e.g., at configure step).


2.  LINKER PLUGINS INFRASTRUCTURE

2.1.  Overview

When -flto or -fopenmp option is given to the GCC driver, linker plugin
invocation is triggered.  The plugin claims the input files containing
.gnu.lto* or .gnu.target_lto* sections for further processing and creates
resolutions file.
After this preliminary work, LTO-wrapper is called.  It is responsible for
sequential calls of GCC.

The first call is needed to run WPA, which performs usual LTO partitioning as
well as partitioning of OpenMP-target sections.  WPA reads bytecode of:
  1) all functions and variables with "omp declare target" attribute;
  2) the outlined bodies of #pragma omp target turned into '*.ompfn' functions;
  3) all the types, symtab etc. needed for that;
from .gnu.target_lto* sections and stores them into an extra partition.

The second call invokes GCC on the partitioned l

Re: inlined memcpy/memset degradation in gcc 4.6 or later

2012-10-02 Thread Michael Zolotukhin
Hi Walter,
I faced with similar problem when I worked on optimizing memcpy
expanding for x86.
x86-specific expander also needed alignment info and it was also
incorrect (i.e. too conservative). Routine get_mem_align_offset () is
used there to determine alignment, but after some moment it started to
return 1-byte instead of 16-byte or whatever alignment, which I
expected.
I made small fix for it and it seemed to work well again:
diff --git a/gcc/emit-rtl.c b/gcc/emit-rtl.c
index 9565c61..9108022 100644
--- a/gcc/emit-rtl.c
+++ b/gcc/emit-rtl.c
@@ -1516,6 +1516,14 @@ get_mem_align_offset (rtx mem, unsigned int align)
   if (TYPE_ALIGN (TREE_TYPE (expr)) < (unsigned int) align)
return -1;
 }
+  else if (TREE_CODE (expr) == MEM_REF)
+{
+  int al, off;
+  get_object_alignment_1 (expr, &al, &offset);
+  offset /= BITS_PER_UNIT;
+  if (al < align)
+   return -1;
+}
   else if (TREE_CODE (expr) == COMPONENT_REF)

So, returning to your problem - probably routines you mentioned also
don't handle MEM_REF (and before some commit they didn't have to).
Also, you could look into routine I mentioned - probably you could
find something useful for you there.

---
Thanks, Michael

On 2 October 2012 18:19, Walter Lee  wrote:
>
> On TILE-Gx, I'm observing a degradation in inlined memcpy/memset in
> gcc 4.6 and later versus gcc 4.4.  Though I find the problem on
> TILE-Gx, I think this is a problem for any architectures with
> SLOW_UNALIGNED_ACCESS set to 1.
>
> Consider the following program:
>
> struct foo {
>   int x;
> };
>
> void copy(struct foo* f0, struct foo* f1)
> {
>   memcpy (f0, f1, sizeof(struct foo));
> }
>
> In gcc 4.4, I get the desired inline memcpy:
>
> copy:
> ld4sr1, r1
> st4 r0, r1
> jrp lr
>
> In gcc 4.7, however, I get inlined byte-by-byte copies:
>
> copy:
> ld1u_add r10, r1, 1
> st1_add  r0, r10, 1
> ld1u_add r10, r1, 1
> st1_add  r0, r10, 1
> ld1u_add r10, r1, 1
> st1_add  r0, r10, 1
> ld1u r10, r1
> st1  r0, r10
> jrp  lr
>
> The inlining of memcpy is done in expand_builtin_memcpy in builtins.c.
> Tracing through that, I see that the alignment of src_align and
> dest_align, which is computed by get_pointer_alignment, has degraded:
> in gcc 4.4 they are 32 bits, but in gcc 4.7 they are 8 bits.  This
> causes the loads generated by the inlined memcopy to be per-byte
> instead of per-4-byte.
>
> Looking further, gcc 4.7 uses the "align" field in "struct
> ptr_info_def" to compute the alignment.  This field appears to be
> initialized in get_ptr_info in tree-ssanames.c but it is always
> initialized to 1 byte and does not appear to change.  gcc 4.4 computes
> its alignment information differently.
>
> I get the same byte-copies with gcc 4.8 and gcc 4.6.
>
> I see a couple related open PRs: 50417, 53535, but no suggested fixes
> for them yet.  Can anyone advise on how this can be fixed?  Should I
> file a new bug, or add this info to one of the existing PRs?
>
> Thanks,
>
> Walter
>



-- 
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.


Re: Adding Rounding Mode to Operations Opcodes in Gimple and RTL

2013-01-10 Thread Michael Zolotukhin
Thanks for the responses!
I'll think about your warnings and decide whether I could afford such
effort or not, but anyway, I wanted to start from GCC, not glibc.
Am I getting it right, that before any other works we need to fix PR
34678 (that's correct number, thanks Mark!), making all passes take
into account that calls could change rounding-modes/raise exceptions,
i.e. make all calls optimization barriers? At least, when no
'aggressive' options were passed to the compiler.

That work seems to be quite big and it really can cause huge tests
fallout, but when it's done adding new subcodes for operations with
rounding shouldn't cause any new fails (as we won't change behavior of
'default' operations).

Michael

On 19 December 2012 03:18, Joseph S. Myers  wrote:
> On Fri, 14 Dec 2012, Michael Zolotukhin wrote:
>
>> Currently, I think the problem could be tackled in the following way:
>> In gimple we'll need to add a pass that would a) find regions with
>> constant, compile-time known rounding mode, b) replace operations with
>> subcodes like plus/minus/div/etc. with the corresponding operations
>> with rounding (plus_ru, plus_rd etc.), c) remove fesetround calls if
>> the region doesn't contain instructions that could depend on rounding
>> mode.
>
> I'd say constant rounding mode optimization is pretty much a corner case -
> yes, constant rounding modes for particular code are useful in practice,
> but the bigger problem is making -frounding-math work reliably - stopping
> optimizations that are invalid when the rounding mode might change
> dynamically (and any call to an external function might happen to call
> fesetround) and, similarly, optimizations that are invalid when exceptions
> might be tested (you can't optimize away ((void) (a + b)) for
> floating-point when exceptions might be tested, for example, as it might
> raise an exception flag - again, any external function might test
> exceptions, or they might be tested after return from the function
> containing that expression).  Then there are probably bugs with libgcc
> functions not raising the right exceptions / handling rounding modes
> correctly, and lots of other issues of detail to address to get these
> things right (including a lot of testsuite work).
>
> Although constant rounding modes are probably more often useful in
> practice than dynamic modes, processor support for them is much more
> limited (although I think IA64 may have support for multiple rounding
> direction registers and specifying which is used in an instruction, which
> is the sort of thing that would help for constant modes).  And C99 / C11
> don't have C bindings for constant rounding modes - proposed bindings can
> be found in WG14 N1664, the current draft of the first part of a five-part
> Technical Specification for C bindings to IEEE 754-2008.
>
> As suggested above, GCC doesn't really have support for even the IEEE
> 754-1985 bindings in C99 / C11 Annex F - no support for the FENV_ACCESS
> pragma, and the -frounding-math -ftrapping-math options don't have all the
> desired effects.  When I've thought about implementation approaches I've
> largely thought about them from the initial correctness standpoint - how
> to add thorough testcases for all the various cases that need to be
> covered, and disabling optimizations fairly crudely for -frounding-math /
> -ftrapping-math as needed, before later trying to optimize.  There's the
> open question of whether the default set of options (which includes
> -ftrapping-math) would need to change to avoid default-options performance
> being unduly affected by making -ftrapping-math actually do everything it
> should for code testing exceptions.
>
> Although the 754-2008 draft bindings include constant rounding directions,
> most of those bindings are new library functions and macros.  I've thought
> a bit about what would be involved in implementing them properly in glibc
> (where you have the usual issues of everything needing implementing for
> five different floating-point types, and thorough testing for all those
> different types) - but given the size of such a project, have no immediate
> plans to work on it - there is a lot to be done on glibc libm as-is just
> to improve correctness for the existing functions (and a lot already done
> for glibc 2.16 and 2.17 in that regard).  (I'd guess each of (proper Annex
> F support in GCC; fixing the remaining correctness issues in glibc libm
> for return values and exceptions; and implementing the N1664 bindings)
> would likely be months of work.)
>
> --
> Joseph S. Myers
> jos...@codesourcery.com



-- 
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.


Re: Adding Rounding Mode to Operations Opcodes in Gimple and RTL

2013-01-11 Thread Michael Zolotukhin
> Yes, doing much related to rounding modes really requires making the
> compiler respect them properly for -frounding-math.  That's not quite
> calls being optimization barriers in general, just for floating point.
>
> * General calls may set, clear or test exceptions, or manipulate the
> rounding mode (as may asms, depending on their inputs / outputs /
> clobbers).
>
> * Floating-point operations have the rounding mode as input.  They may set
> (but not clear or test) floating-point exception flags.
>
> * Thus in general floating-point operations may not be moved across most
> calls (or relevant asms), or values from one side of a call reused for the
> same operation with the same inputs appearing on the other side of the
> call.
>
> * Statements such as "(void) (a * b);" can't be eliminated because they
> may raise exceptions.  (That's purely about exceptions, not rounding
> modes.)

I think we could need some fake variables to reflect current rounding
mode/exception flags. These variables then should be updated in the
statements you pointed above - i.e. we'll need to build def-use links
for them basing on these statements.
I was also thinking of another approach: adding some attribute to the
call-stmt itself, but that could be difficult to be taken into account
in the optimizations, working on def-use and not-iterating over every
statement. As far as I understand, CCP is an example of such
optimization. Is it correct or am I missing something?

> Personally I'd think a natural starting point on the compiler side would
> be to write a reasonably thorough and systematic testsuite for such
> issues.  That would cover all operations, for all floating-point types
> (including ones such as __float128 and __float80), and conversions between
> all pairs of floating-point types and either way between each
> floating-point type and each integer type (including __int128 / unsigned
> __int128), with operands being any of (constants, non-volatile variables
> initialized with constants, volatile variables, vectors) and results being
> (discarded, stored in non-volatile variables, stored in volatile
> variables), in all the rounding modes, testing both results and exceptions
> and confirming proper results when an operation is repeated after changes
> of rounding mode or clearing exceptions.

We mostly have problems when there is an 'interaction' between
different rounding modes - so a ton of tests that checking correctness
of a single operation in a specific rounding mode won't catch it. We
could place all such tests in one file/function so that the compiler
would transform it as it does now, so we'll catch the fail - but in
this case we don't need many tests.

So, generally I like the idea of having tests covering all the cases
and then fixing them one-by-one, but I didn't catch what these tests
would be except the ones from the trackers - it seems useless to have
a bunch of tests, each of which contains a single operation and
compares the result, even if we have a version of such test for all
datatypes and rounding modes.

For now I see one general problem (that calls aren't regarded as
something that could change result of FP-operations) and it definitely
needs a test, but I don't see any need in many tests here.

---
Thanks, Michael



On 10 January 2013 22:04, Joseph S. Myers  wrote:
> On Thu, 10 Jan 2013, Michael Zolotukhin wrote:
>
>> Thanks for the responses!
>> I'll think about your warnings and decide whether I could afford such
>> effort or not, but anyway, I wanted to start from GCC, not glibc.
>> Am I getting it right, that before any other works we need to fix PR
>> 34678 (that's correct number, thanks Mark!), making all passes take
>> into account that calls could change rounding-modes/raise exceptions,
>> i.e. make all calls optimization barriers? At least, when no
>> 'aggressive' options were passed to the compiler.
>
> There are various overlapping bugs in Bugzilla for issues where
> -frounding-math -ftrapping-math fail to implement all of FENV_ACCESS (I
> think of exceptions and rounding modes support together, since they have
> many of the same issues and both are covered by FENV_ACCESS, although it
> may be possible to fix only a subset of the issues, e.g. just rounding
> modes without exceptions).  To what extent the issues really duplicate
> each other isn't entirely clear; I'd advise looking at all the testcases
> in all relevant PRs (both open (576 20785 27682 29186 30568 34678), and
> others marked as duplicates of open ones), even if you then end up only
> working on a subset of the problems.
>
> Yes, doing much related to rounding modes really requires making the
> compiler respect them properly for

Re: Adding Rounding Mode to Operations Opcodes in Gimple and RTL

2013-01-18 Thread Michael Zolotukhin
Sure, the tests are of utmost importance here. By the way, in what
suite should they be?

As for the changes in the compiler itself - what do you think about
introduction of a fake variable, reflecting rounding mode (similar
variables could be introduced for exception flags and other
properties). How difficult would it be to do that (we'll need to add
implicit dependencies from those variables to all FP-operations and
kill their values after each call)?

Thanks, Michael

PS: I'll return to this discussion in two weeks, after my vacations.

On 17 January 2013 20:54, Joseph S. Myers  wrote:
> I should add that the recent observation of bugs on some platforms with
> unordered comparisons being wrongly used instead of ordered ones
> illustrates my point about the value of having proper test coverage for
> each individual operation, even though some bugs will only show in more
> complicated code.
>
> --
> Joseph S. Myers
> jos...@codesourcery.com



-- 
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.