date:20140225

Re: [RFC] Introducing MIPS O32 ABI Extension for FR0 and FR1 Interlinking

2014-02-25 Thread Richard Sandiford

Doug Gilmore  writes:
> On 02/24/2014 10:42 AM, Richard Sandiford wrote:
>>...
>>> AIUI the old form never really worked reliably due to things like
>>> newlib's setjmp not preserving the odd-numbered registers, so it doesn't
 seem worth keeping around.  Also, the old form is identified by the GNU
 attribute (4, 4) so it'd be easy for the linker to reject links between
 the old and the new form.
>>>
>>> That is true. You will have noticed a number of changes over recent
>>> months to start fixing fp64 as currently defined but having found this
>>> new solution then such fixes are no longer important. The lack of
>>> support for gp32 fp64 in linux is further reason to permit redefining
>>> it. Would you be happy to retain the same builtin defines for FP64 if
>>> changing its behaviour (i.e. __mips_fpr=64)?
>>
>>I think that should be OK.  I suppose a natural follow-on question
>>is what __mips_fpr should be for -mfpxx.  Maybe just 0?
> I think we should think carefully about just making -mfp64 just disappear.
> The support has existed for bare iron for quite a while, and we do internal
> testing of MSA using -mfp64.  I'd rather avoid a flag day.  It would be
> good to continue recognizing that object files with attribute (4, 4)
> (-mfp64) are not compatible with other objects.

Right, that was the idea.  (4, 4) would always mean the current form
of -mfp64 and the linker would reject links between (4, 4) and the
new -mfp64 form.

The flag day was more on the GCC and GAS side.  I don't see the point
in supporting both forms there at the same time, since it significantly
complicates the interface and since AIUI the old form was never really
suitable for production use.

Thanks,
Richard

RE: [RFC] Introducing MIPS O32 ABI Extension for FR0 and FR1 Interlinking

2014-02-25 Thread Matthew Fortune

Richard Sandiford  writes
> Doug Gilmore  writes:
> > On 02/24/2014 10:42 AM, Richard Sandiford wrote:
> >>...
> >>> AIUI the old form never really worked reliably due to things like
> >>> newlib's setjmp not preserving the odd-numbered registers, so it
> >>> doesn't
>  seem worth keeping around.  Also, the old form is identified by the
>  GNU attribute (4, 4) so it'd be easy for the linker to reject links
>  between the old and the new form.
> >>>
> >>> That is true. You will have noticed a number of changes over recent
> >>> months to start fixing fp64 as currently defined but having found
> >>> this new solution then such fixes are no longer important. The lack
> >>> of support for gp32 fp64 in linux is further reason to permit
> >>> redefining it. Would you be happy to retain the same builtin defines
> >>> for FP64 if changing its behaviour (i.e. __mips_fpr=64)?
> >>
> >>I think that should be OK.  I suppose a natural follow-on question is
> >>what __mips_fpr should be for -mfpxx.  Maybe just 0?
> > I think we should think carefully about just making -mfp64 just
> disappear.
> > The support has existed for bare iron for quite a while, and we do
> > internal testing of MSA using -mfp64.  I'd rather avoid a flag day.
> > It would be good to continue recognizing that object files with
> > attribute (4, 4)
> > (-mfp64) are not compatible with other objects.
> 
> Right, that was the idea.  (4, 4) would always mean the current form of
> -mfp64 and the linker would reject links between (4, 4) and the new -
> mfp64 form.
> 
> The flag day was more on the GCC and GAS side.  I don't see the point in
> supporting both forms there at the same time, since it significantly
> complicates the interface and since AIUI the old form was never really
> suitable for production use.

That sounds OK to me.

I'm aiming to have an experimental implementation of the calling convention 
changes as soon as possible although I am having difficulties getting the frx 
calling convention working correctly.

The problem is that frx needs to treat registers as 64bit sometimes and 32bit 
at other times.
a) I need the aliasing that 32bit registers gives me (use of an even-numbered 
double clobbers the corresponding odd-numbered single. This is to prevent both 
the double and odd numbered single being used simultaneously.
b) I need the 64bit register layout to ensure that 64bit values in caller-saved 
registers are saved as 64-bit (rather than 2x32-bit) and 32-bit registers are 
saved as 32-bit and never combined into a 64-bit save. Caller-save.c flattens 
the caller-save problem down to look at only hard registers not modes which is 
frustrating.

It looks like caller-save.c would need a lot of work to achieve b) with 32-bit 
hard registers but I equally don't know how I could achieve a) for 64-bit 
registers. I suspect a) is marginally easier to solve in the end but have to 
find a way to say that using reg x as 64-bit prevents allocation of x+1 as 
32-bit despite registers being 64-bit. The easy option is to go for 64-bit 
registers and never use odd-numbered registers for single-precision or 
double-precision but I don't really want frx to be limited to that if at all 
possible. Any suggestions?

The special handling for callee-saved registers is not a problem (I think) as 
it is all backend code for that (assuming a or b is resolved).

Regards,
Matthew

About gsoc 2014 OpenMP 4.0 Projects

2014-02-25 Thread guray ozen

Hello,

I'm master student at high-performance computing at barcelona
supercomputing center. And I'm working on my thesis regarding openmp
accelerator model implementation onto our compiler (OmpSs). Actually i
almost finished implementation of all new directives  to generate CUDA
code and same implementation OpenCL doesn't take so much according to
my design. But i haven't even tried for Intel mic and apu other
hardware accelerator :) Now i'm bench-marking output kernel codes
which are generated by my compiler. although output kernel is
generally naive, speedup is not very very bad. when I compare results
with HMPP OpenACC 3.2.x compiler, speedups are almost same or in some
cases my results are slightly better than. That's why in this term, i
am going to work on compiler level or runtime level optimizations for
gpus.

When i looked gcc openmp 4.0 project, i couldn't see any things about
code generation. Are you going to announce later? or should i apply
gsoc with my idea about code generations and device code
optimizations?

Güray Özen
~grypp

Re: [RFC] Introducing MIPS O32 ABI Extension for FR0 and FR1 Interlinking

2014-02-25 Thread Richard Sandiford

Matthew Fortune  writes:
>> >> If we do end up using ELF flags then maybe adding two new EF_MIPS_ABI
>> >> enums would be better.  It's more likely to be trapped by old loaders
>> >> and avoids eating up those precious remaining bits.
>> >
>> > Sound's reasonable but I'm still trying to determine how this
>> > information can be propagated from loader to dynamic loader.
>> 
>> The dynamic loader has access to the ELF headers so I didn't think it
>> would need any help.
>
> As I understand it the dynamic loader only has specific access to the
> program headers of the executable not the ELF headers. There is no
> question that the dynamic loader has access to DSO ELF headers but we
> need the start point too.

Sorry, forgot about that.  In that case maybe program headers would be
best, like you say.  I.e. we could use a combination of GNU attributes
and a new program header, with the program header hopefully being more
general than for just this case.  I suppose this comes back to the
thread from binutils@ last year about how to manage the dwindling
number of free flags:

https://www.sourceware.org/ml/binutils/2013-09/msg00039.html
 to https://www.sourceware.org/ml/binutils/2013-09/msg00099.html

>> >> You didn't say specifically how a static program's crt code would
>> >> know whether it was linked as modeless or in a specific FR mode.
>> >> Maybe the linker could define a special hidden symbol?
>> >
>> > Why do you say crt rather than dlopen? The mode requirement should
>> > only matter if you want to change it and dlopen should be able to
>> > access information in the same way that a dynamic linker would. It may
>> > seem redundant but perhaps we end up having to mark an executable with
>> > mode requirements in two ways. The primary one being the ELF flag and
>> > the secondary one being a processor specific program header. The ELF
>> > flags are easy to use/already used for the program loader and when
>> > scanning the needs of an object being loaded, but the program header
>> > is something that is easy to inspect for an already-loaded object.
>> > Overall though, a new program header would be sufficient in all cases,
>> > with a few modifications here and there.
>> 
>> Sorry, what I meant was: how would an executable built with -static be
>> handled?  And I was assuming it would be up to the executable's startup
>> code to set the FR mode.  That startup code (from glibc) would normally
>> be modeless itself but would need to know whether any FR0 or FR1 objects
>> were linked in.  (FWIW ifuncs have a similar
>> problem: without the loader to help, the startup code has to resolve the
>> ifuncs itself.  The static linker defines special symbols around a block
>> of IRELATIVE relocs and then the startup code applies those relocs in a
>> similar way to the dynamic linker.  I was thinking a linker-defined
>> symbol could be used to record the FR mode too.)
>> 
>> But perhaps you were thinking of getting the kernel to set the FR mode
>> instead?
>
> I was thinking the kernel would set an initial FR mode that was at least
> compatible with the ELF flags. Do you feel all this should be done in
> user space? We only get user mode FR control in MIPS r5 so this would
> make it more challenging to get into FR1 mode for MIPS32r2. I'd prefer
> not to be able to load an FR1 program than crash in the crt while trying
> to turn it on. There is however some expectation that the kernel would
> trap and emulate UFR on MIPS32r2 for the dynamic loader case anyway.

Right -- the kernel needs to let userspace change FR if the dynamic
loader case is going to work.  And I think if it's handled by userspace
for dynamic executables then it should be handled by userspace for
static ones too.  Especially since the mechanism used for static
executables would then be the same as for bare metal, meaning that we
only really have 2 cases rather than 3.

> Is it OK to continue these library related discussions here or should I
> split the bare metal handling to newlib and linux libraries to glibc?
> There is value in keeping things together but equally it is perhaps off
> topic.

Not sure TBH, but noone's complained so far :-)

Thanks,
Richard

Re: [RFC] Introducing MIPS O32 ABI Extension for FR0 and FR1 Interlinking

2014-02-25 Thread Richard Sandiford

Matthew Fortune  writes:
> That sounds OK to me.
>
> I'm aiming to have an experimental implementation of the calling
> convention changes as soon as possible although I am having difficulties
> getting the frx calling convention working correctly.
>
> The problem is that frx needs to treat registers as 64bit sometimes and
> 32bit at other times.
> a) I need the aliasing that 32bit registers gives me (use of an
> even-numbered double clobbers the corresponding odd-numbered
> single. This is to prevent both the double and odd numbered single being
> used simultaneously.
> b) I need the 64bit register layout to ensure that 64bit values in
> caller-saved registers are saved as 64-bit (rather than 2x32-bit) and
> 32-bit registers are saved as 32-bit and never combined into a 64-bit
> save. Caller-save.c flattens the caller-save problem down to look at
> only hard registers not modes which is frustrating.
>
> It looks like caller-save.c would need a lot of work to achieve b) with
> 32-bit hard registers but I equally don't know how I could achieve a)
> for 64-bit registers. I suspect a) is marginally easier to solve in the
> end but have to find a way to say that using reg x as 64-bit prevents
> allocation of x+1 as 32-bit despite registers being 64-bit. The easy
> option is to go for 64-bit registers and never use odd-numbered
> registers for single-precision or double-precision but I don't really
> want frx to be limited to that if at all possible. Any suggestions?

Treating it as a limited from of FR0 mode seems best.  I don't think
there's any practical way of doing (a) without making HARD_REGNO_NREGS
be 2 for a DFmode FPR, at which point any wrong assumptions about
paired registers in caller-save.c would kick in.

We'd only be making this change in the next release cycle, and we really
should look to move to LRA for that cycle too.  caller-save.c is specific
to reload and so wouldn't be a problem.  Of course, you might need to do
stuff in LRA instead.

Thanks,
Richard

[GSoC] GCC has been accepted to GSoC 2014

2014-02-25 Thread Maxim Kuvyrkov

Hi All,

GCC has been accepted as mentoring organization to Google Summer of Code 2014, 
and we are off to the races!

If you want to be a GCC GSoC student check out the project idea page at 
http://gcc.gnu.org/wiki/SummerOfCode .  Feel free to ask questions on IRC [1] 
and get in touch with your potential mentors.  If you are not sure who to 
contact -- send me an email at maxim.kuvyr...@linaro.org.

If you are a GCC developer then create a profile at 
http://www.google-melange.com/gsoc/homepage/google/gsoc2014 to be able to rank 
student applications .  Once registered, connect with "GCC - GNU Compiler 
Collection" organization.

If you actively want to mentor a student project, then note so in your GSoC 
connection request.

If you have any questions or comments please contact your friendly GSoC admin 
via IRC (maximk), email (maxim.kuvyr...@linaro.org) or Skype/Hangouts.

Thank you,

[1] irc://irc.oftc.net/#gcc

--
Maxim Kuvyrkov
www.linaro.org

RE: [RFC] Introducing MIPS O32 ABI Extension for FR0 and FR1 Interlinking

2014-02-25 Thread Matthew Fortune

> Matthew Fortune  writes:
> >> >> If we do end up using ELF flags then maybe adding two new
> >> >> EF_MIPS_ABI enums would be better.  It's more likely to be trapped
> >> >> by old loaders and avoids eating up those precious remaining bits.
> >> >
> >> > Sound's reasonable but I'm still trying to determine how this
> >> > information can be propagated from loader to dynamic loader.
> >>
> >> The dynamic loader has access to the ELF headers so I didn't think it
> >> would need any help.
> >
> > As I understand it the dynamic loader only has specific access to the
> > program headers of the executable not the ELF headers. There is no
> > question that the dynamic loader has access to DSO ELF headers but we
> > need the start point too.
> 
> Sorry, forgot about that.  In that case maybe program headers would be
> best, like you say.  I.e. we could use a combination of GNU attributes
> and a new program header, with the program header hopefully being more
> general than for just this case.  I suppose this comes back to the
> thread from binutils@ last year about how to manage the dwindling number
> of free flags:
> 
> https://www.sourceware.org/ml/binutils/2013-09/msg00039.html
>  to https://www.sourceware.org/ml/binutils/2013-09/msg00099.html
> 
> >> >> You didn't say specifically how a static program's crt code would
> >> >> know whether it was linked as modeless or in a specific FR mode.
> >> >> Maybe the linker could define a special hidden symbol?
> >> >
> >> > Why do you say crt rather than dlopen? The mode requirement should
> >> > only matter if you want to change it and dlopen should be able to
> >> > access information in the same way that a dynamic linker would. It
> >> > may seem redundant but perhaps we end up having to mark an
> >> > executable with mode requirements in two ways. The primary one
> >> > being the ELF flag and the secondary one being a processor specific
> >> > program header. The ELF flags are easy to use/already used for the
> >> > program loader and when scanning the needs of an object being
> >> > loaded, but the program header is something that is easy to inspect
> for an already-loaded object.
> >> > Overall though, a new program header would be sufficient in all
> >> > cases, with a few modifications here and there.
> >>
> >> Sorry, what I meant was: how would an executable built with -static
> >> be handled?  And I was assuming it would be up to the executable's
> >> startup code to set the FR mode.  That startup code (from glibc)
> >> would normally be modeless itself but would need to know whether any
> >> FR0 or FR1 objects were linked in.  (FWIW ifuncs have a similar
> >> problem: without the loader to help, the startup code has to resolve
> >> the ifuncs itself.  The static linker defines special symbols around
> >> a block of IRELATIVE relocs and then the startup code applies those
> >> relocs in a similar way to the dynamic linker.  I was thinking a
> >> linker-defined symbol could be used to record the FR mode too.)
> >>
> >> But perhaps you were thinking of getting the kernel to set the FR
> >> mode instead?
> >
> > I was thinking the kernel would set an initial FR mode that was at
> > least compatible with the ELF flags. Do you feel all this should be
> > done in user space? We only get user mode FR control in MIPS r5 so
> > this would make it more challenging to get into FR1 mode for MIPS32r2.
> > I'd prefer not to be able to load an FR1 program than crash in the crt
> > while trying to turn it on. There is however some expectation that the
> > kernel would trap and emulate UFR on MIPS32r2 for the dynamic loader
> case anyway.
> 
> Right -- the kernel needs to let userspace change FR if the dynamic
> loader case is going to work.  And I think if it's handled by userspace
> for dynamic executables then it should be handled by userspace for
> static ones too.  Especially since the mechanism used for static
> executables would then be the same as for bare metal, meaning that we
> only really have 2 cases rather than 3.

Although the dynamic case does mean mode switching must be possible at user 
level I do think it is important for the OS and bare metal crt to prepare an 
environment that is suitable for the original program including setting an 
appropriate FR mode. I would use the existing support for linux and bare metal 
for getting the fr mode correct for O32 vs N[32|64] as a basis for this. This 
initial guarantee would be quite helpful especially in the static link for 
linux userland as it simply wouldn't need to worry. I can understand the desire 
to keep the number of mechanisms to set FR mode to a minimum but the fact that 
bare metal runs privileged and linux userland runs unprivileged says to me that 
they will naturally take different paths on some of this. There are other 
aspects such as whether the kernel informs user land that UFR is available or 
not, via HWCAPs and consideration over what point we would want to see a 
failure when mode requ

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-25 Thread Paul E. McKenney

On Mon, Feb 24, 2014 at 10:05:52PM -0800, Linus Torvalds wrote:
> On Mon, Feb 24, 2014 at 3:35 PM, Linus Torvalds
>  wrote:
> >
> > Litmus test 1:
> >
> > p = atomic_read(pp, consume);
> > if (p == &variable)
> > return p->val;
> >
> >is *NOT* ordered
> 
> Btw, don't get me wrong. I don't _like_ it not being ordered, and I
> actually did spend some time thinking about my earlier proposal on
> strengthening the 'consume' ordering.

Understood.

> I have for the last several years been 100% convinced that the Intel
> memory ordering is the right thing, and that people who like weak
> memory ordering are wrong and should try to avoid reproducing if at
> all possible. But given that we have memory orderings like power and
> ARM, I don't actually see a sane way to get a good strong ordering.
> You can teach compilers about cases like the above when they actually
> see all the code and they could poison the value chain etc. But it
> would be fairly painful, and once you cross object files (or even just
> functions in the same compilation unit, for that matter), it goes from
> painful to just "ridiculously not worth it".

And I have indeed seen a post or two from you favoring stronger memory
ordering over the past few years.  ;-)

> So I think the C semantics should mirror what the hardware gives us -
> and do so even in the face of reasonable optimizations - not try to do
> something else that requires compilers to treat "consume" very
> differently.

I am sure that a great many people would jump for joy at the chance to
drop any and all RCU-related verbiage from the C11 and C++11 standards.
(I know, you aren't necessarily advocating this, but given what you
say above, I cannot think what verbiage that would remain.)

The thing that makes me very nervous is how much the definition of
"reasonable optimization" has changed.  For example, before the
2.6.10 Linux kernel, we didn't even apply volatile semantics to
fetches of RCU-protected pointers -- and as far as I know, never
needed to.  But since then, there have been several cases where the
compiler happily hoisted a normal load out of a surprisingly large loop.
Hardware advances can come into play as well.  For example, my very first
RCU work back in the early 90s was on a parallel system whose CPUs had
no branch-prediction hardware (80386 or 80486, I don't remember which).
Now people talk about compilers using branch prediction hardware to
implement value-speculation optimizations.  Five or ten years from now,
who knows what crazy optimizations might be considered to be completely
reasonable?

Are ARM and Power really the bad boys here?  Or are they instead playing
the role of the canary in the coal mine?

> If people made me king of the world, I'd outlaw weak memory ordering.
> You can re-order as much as you want in hardware with speculation etc,
> but you should always *check* your speculation and make it *look* like
> you did everything in order. Which is pretty much the intel memory
> ordering (ignoring the write buffering).

Speaking as someone who got whacked over the head with DEC Alpha when
first presenting RCU to the Digital UNIX folks long ago, I do have some
sympathy with this line of thought.  But as you say, it is not the world
we currently live in.

Of course, in the final analysis, your kernel, your call.

Thanx, Paul

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-25 Thread Linus Torvalds

On Mon, Feb 24, 2014 at 10:00 PM, Paul E. McKenney
 wrote:
>
> So let me see if I understand your reasoning.  My best guess is that it
> goes something like this:
>
> 1.  The Linux kernel contains code that passes pointers from
> rcu_dereference() through external functions.

No, actually, it's not so much Linux-specific at all.

I'm actually thinking about what I'd do as a compiler writer, and as a
defender the "C is a high-level assembler" concept.

I love C. I'm a huge fan. I think it's a great language, and I think
it's a great language not because of some theoretical issues, but
because it is the only language around that actually maps fairly well
to what machines really do.

And it's a *simple* language. Sure, it's not quite as simple as it
used to be, but look at how thin the "K&R book" is. Which pretty much
describes it - still.

That's the real strength of C, and why it's the only language serious
people use for system programming.  Ignore C++ for a while (Jesus
Xavier Christ, I've had to do C++ programming for subsurface), and
just think about what makes _C_ a good language.

I can look at C code, and I can understand what the code generation
is, and what it will really *do*. And I think that's important.
Abstractions that hide what the compiler will actually generate are
bad abstractions.

And ok, so this is obviously Linux-specific in that it's generally
only Linux where I really care about the code generation, but I do
think it's a bigger issue too.

So I want C features to *map* to the hardware features they implement.
The abstractions should match each other, not fight each other.

> Actually, the fact that there are more potential optimizations than I can
> think of is a big reason for my insistence on the carries-a-dependency
> crap.  My lack of optimization omniscience makes me very nervous about
> relying on there never ever being a reasonable way of computing a given
> result without preserving the ordering.

But if I can give two clear examples that are basically identical from
a syntactic standpoint, and one clearly can be trivially optimized to
the point where the ordering guarantee goes away, and the other
cannot, and you cannot describe the difference, then I think your
description is seriously lacking.

And I do *not* think the C language should be defined by how it can be
described. Leave that to things like Haskell or LISP, where the goal
is some kind of completeness of the language that is about the
language, not about the machines it will run on.

>> So the code sequence I already mentioned is *not* ordered:
>>
>> Litmus test 1:
>>
>> p = atomic_read(pp, consume);
>> if (p == &variable)
>> return p->val;
>>
>>is *NOT* ordered, because the compiler can trivially turn this into
>> "return variable.val", and break the data dependency.
>
> Right, given your model, the compiler is free to produce code that
> doesn't order the load from pp against the load from p->val.

Yes. Note also that that is what existing compilers would actually do.

And they'd do it "by mistake": they'd load the address of the variable
into a register, and then compare the two registers, and then end up
using _one_ of the registers as the base pointer for the "p->val"
access, but I can almost *guarantee* that there are going to be
sequences where some compiler will choose one register over the other
based on some random detail.

So my model isn't just a "model", it also happens to descibe reality.

> Indeed, it won't work across different compilation units unless
> the compiler is told about it, which is of course the whole point of
> [[carries_dependency]].  Understood, though, the Linux kernel currently
> does not have anything that could reasonably automatically generate those
> [[carries_dependency]] attributes.  (Or are there other reasons why you
> believe [[carries_dependency]] is problematic?)

So I think carries_dependency is problematic because:

 - it's not actually in C11 afaik

 - it requires the programmer to solve the problem of the standard not
matching the hardware.

 - I think it's just insanely ugly, *especially* if it's actually
meant to work so that the current carries-a-dependency works even for
insane expressions like "a-a".

in practice, it's one of those things where I guess nobody actually
would ever use it.

> Of course, I cannot resist putting forward a third litmus test:
>
> static struct foo variable1;
> static struct foo variable2;
> static struct foo *pp = &variable1;
>
> T1: initialize_foo(&variable2);
> atomic_store_explicit(&pp, &variable2, memory_order_release);
> /* The above is the only store to pp in this translation unit,
>  * and the address of pp is not exported in any way.
>  */
>
> T2: if (p == &variable1)
> return p->val1; /* Must be variable1.val1. */
> else
> return p->val2; /* Must be variable2.val2. */
>
> My guess is that you

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-25 Thread George Spelvin

 wrote:
>  wrote:
>> I have for the last several years been 100% convinced that the Intel
>> memory ordering is the right thing, and that people who like weak
>> memory ordering are wrong and should try to avoid reproducing if at
>> all possible.
>
> Are ARM and Power really the bad boys here?  Or are they instead playing
> the role of the canary in the coal mine?

To paraphrase some older threads, I think Linus's argument is that
weak memory ordering is like branch delay slots: a way to make a simple
implementation simpler, but ends up being no help to a more aggressive
implementation.

Branch delay slots give a one-cycle bonus to in-order cores, but
once you go superscalar and add branch prediction, they stop helping,
and once you go full out of order, they're just an annoyance.

Likewise, I can see the point that weak ordering can help make a simple
cache interface simpler, but once you start doing speculative loads,
you've already bought and paid for all the hardware you need to do
stronger coherency.

Another thing that requires all the strong-coherency machinery is
a high-performance implementation of the various memory barrier and
synchronization operations.  Yes, a low-performance (drain the pipeline)
implementation is tolerable if the instructions aren't used frequently,
but once you're really trying, it doesn't save complexity.

Once you're there, strong coherency always doesn't actually cost you any
time outside of critical synchronization code, and it both simplifies
and speeds up the tricky synchronization software.

So PPC and ARM's weak ordering are not the direction the future is going.
Rather, weak ordering is something that's only useful in a limited
technology window, which is rapidly passing.

If you can find someone in IBM who's worked on the Z series cache
coherency (extremely strong ordering), they probably have some useful
insights.  The big question is if strong ordering, once you've accepted
the implementation complexity and area, actually costs anything in
execution time.  If there's an unavoidable cost which weak ordering saves,
that's significant.

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-25 Thread Jeff Law


On 02/25/14 17:15, Paul E. McKenney wrote:

I have for the last several years been 100% convinced that the Intel
memory ordering is the right thing, and that people who like weak
memory ordering are wrong and should try to avoid reproducing if at
all possible. But given that we have memory orderings like power and
ARM, I don't actually see a sane way to get a good strong ordering.
You can teach compilers about cases like the above when they actually
see all the code and they could poison the value chain etc. But it
would be fairly painful, and once you cross object files (or even just
functions in the same compilation unit, for that matter), it goes from
painful to just "ridiculously not worth it".


And I have indeed seen a post or two from you favoring stronger memory
ordering over the past few years.  ;-)

I couldn't agree more.



Are ARM and Power really the bad boys here?  Or are they instead playing
the role of the canary in the coal mine?
That's a question I've been struggling with recently as well.  I suspect 
they (arm, power) are going to be the outliers rather than the canary. 
While the weaker model may give them some advantages WRT scalability, I 
don't think it'll ultimately be enough to overcome the difficulty in 
writing correct low level code for them.


Regardless, they're here and we have to deal with them.


Jeff

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-25 Thread Paul E. McKenney

On Tue, Feb 25, 2014 at 05:47:03PM -0800, Linus Torvalds wrote:
> On Mon, Feb 24, 2014 at 10:00 PM, Paul E. McKenney
>  wrote:
> >
> > So let me see if I understand your reasoning.  My best guess is that it
> > goes something like this:
> >
> > 1.  The Linux kernel contains code that passes pointers from
> > rcu_dereference() through external functions.
> 
> No, actually, it's not so much Linux-specific at all.
> 
> I'm actually thinking about what I'd do as a compiler writer, and as a
> defender the "C is a high-level assembler" concept.
> 
> I love C. I'm a huge fan. I think it's a great language, and I think
> it's a great language not because of some theoretical issues, but
> because it is the only language around that actually maps fairly well
> to what machines really do.
> 
> And it's a *simple* language. Sure, it's not quite as simple as it
> used to be, but look at how thin the "K&R book" is. Which pretty much
> describes it - still.
> 
> That's the real strength of C, and why it's the only language serious
> people use for system programming.  Ignore C++ for a while (Jesus
> Xavier Christ, I've had to do C++ programming for subsurface), and
> just think about what makes _C_ a good language.

The last time I used C++ for a project was in 1990.  It was a lot smaller
then.

> I can look at C code, and I can understand what the code generation
> is, and what it will really *do*. And I think that's important.
> Abstractions that hide what the compiler will actually generate are
> bad abstractions.
> 
> And ok, so this is obviously Linux-specific in that it's generally
> only Linux where I really care about the code generation, but I do
> think it's a bigger issue too.
> 
> So I want C features to *map* to the hardware features they implement.
> The abstractions should match each other, not fight each other.

OK...

> > Actually, the fact that there are more potential optimizations than I can
> > think of is a big reason for my insistence on the carries-a-dependency
> > crap.  My lack of optimization omniscience makes me very nervous about
> > relying on there never ever being a reasonable way of computing a given
> > result without preserving the ordering.
> 
> But if I can give two clear examples that are basically identical from
> a syntactic standpoint, and one clearly can be trivially optimized to
> the point where the ordering guarantee goes away, and the other
> cannot, and you cannot describe the difference, then I think your
> description is seriously lacking.

In my defense, my plan was to constrain the compiler to retain the
ordering guarantee in either case.  Yes, I did notice that you find
that unacceptable.

> And I do *not* think the C language should be defined by how it can be
> described. Leave that to things like Haskell or LISP, where the goal
> is some kind of completeness of the language that is about the
> language, not about the machines it will run on.

I am with you up to the point that the fancy optimizers start kicking
in.  I don't know how to describe what the optimizers are and are not
permitted to do strictly in terms of the underlying hardware.

> >> So the code sequence I already mentioned is *not* ordered:
> >>
> >> Litmus test 1:
> >>
> >> p = atomic_read(pp, consume);
> >> if (p == &variable)
> >> return p->val;
> >>
> >>is *NOT* ordered, because the compiler can trivially turn this into
> >> "return variable.val", and break the data dependency.
> >
> > Right, given your model, the compiler is free to produce code that
> > doesn't order the load from pp against the load from p->val.
> 
> Yes. Note also that that is what existing compilers would actually do.
> 
> And they'd do it "by mistake": they'd load the address of the variable
> into a register, and then compare the two registers, and then end up
> using _one_ of the registers as the base pointer for the "p->val"
> access, but I can almost *guarantee* that there are going to be
> sequences where some compiler will choose one register over the other
> based on some random detail.
> 
> So my model isn't just a "model", it also happens to descibe reality.

Sounds to me like your model -is- reality.  I believe that it is useful
to constrain reality from time to time, but understand that you vehemently
disagree.

> > Indeed, it won't work across different compilation units unless
> > the compiler is told about it, which is of course the whole point of
> > [[carries_dependency]].  Understood, though, the Linux kernel currently
> > does not have anything that could reasonably automatically generate those
> > [[carries_dependency]] attributes.  (Or are there other reasons why you
> > believe [[carries_dependency]] is problematic?)
> 
> So I think carries_dependency is problematic because:
> 
>  - it's not actually in C11 afaik

Indeed it is not, but I bet that gcc will implement it like it does the
other attributes that are not part of C11.

>  - it requires the programmer to solve the pr

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-25 Thread Paul E. McKenney

On Tue, Feb 25, 2014 at 10:06:53PM -0500, George Spelvin wrote:
>  wrote:
> >  wrote:
> >> I have for the last several years been 100% convinced that the Intel
> >> memory ordering is the right thing, and that people who like weak
> >> memory ordering are wrong and should try to avoid reproducing if at
> >> all possible.
> >
> > Are ARM and Power really the bad boys here?  Or are they instead playing
> > the role of the canary in the coal mine?
> 
> To paraphrase some older threads, I think Linus's argument is that
> weak memory ordering is like branch delay slots: a way to make a simple
> implementation simpler, but ends up being no help to a more aggressive
> implementation.
> 
> Branch delay slots give a one-cycle bonus to in-order cores, but
> once you go superscalar and add branch prediction, they stop helping,
> and once you go full out of order, they're just an annoyance.
> 
> Likewise, I can see the point that weak ordering can help make a simple
> cache interface simpler, but once you start doing speculative loads,
> you've already bought and paid for all the hardware you need to do
> stronger coherency.
> 
> Another thing that requires all the strong-coherency machinery is
> a high-performance implementation of the various memory barrier and
> synchronization operations.  Yes, a low-performance (drain the pipeline)
> implementation is tolerable if the instructions aren't used frequently,
> but once you're really trying, it doesn't save complexity.
> 
> Once you're there, strong coherency always doesn't actually cost you any
> time outside of critical synchronization code, and it both simplifies
> and speeds up the tricky synchronization software.
> 
> 
> So PPC and ARM's weak ordering are not the direction the future is going.
> Rather, weak ordering is something that's only useful in a limited
> technology window, which is rapidly passing.

That does indeed appear to be Intel's story.  Might well be correct.
Time will tell.

> If you can find someone in IBM who's worked on the Z series cache
> coherency (extremely strong ordering), they probably have some useful
> insights.  The big question is if strong ordering, once you've accepted
> the implementation complexity and area, actually costs anything in
> execution time.  If there's an unavoidable cost which weak ordering saves,
> that's significant.

There has been a lot of ink spilled on this argument.  ;-)

PPC has much larger CPU counts than does the mainframe.  On the other
hand, there are large x86 systems.  Some claim that there are differences
in latency due to the different approaches, and there could be a long
argument about whether all this in inherent in the memory ordering or
whether it is due to implementation issues.

I don't claim to know the answer.  I do know that ARM and PPC are
here now, and that I need to deal with them.

Thanx, Paul

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-25 Thread Paul E. McKenney

On Tue, Feb 25, 2014 at 08:32:38PM -0700, Jeff Law wrote:
> On 02/25/14 17:15, Paul E. McKenney wrote:
> >>I have for the last several years been 100% convinced that the Intel
> >>memory ordering is the right thing, and that people who like weak
> >>memory ordering are wrong and should try to avoid reproducing if at
> >>all possible. But given that we have memory orderings like power and
> >>ARM, I don't actually see a sane way to get a good strong ordering.
> >>You can teach compilers about cases like the above when they actually
> >>see all the code and they could poison the value chain etc. But it
> >>would be fairly painful, and once you cross object files (or even just
> >>functions in the same compilation unit, for that matter), it goes from
> >>painful to just "ridiculously not worth it".
> >
> >And I have indeed seen a post or two from you favoring stronger memory
> >ordering over the past few years.  ;-)
> I couldn't agree more.
> 
> >
> >Are ARM and Power really the bad boys here?  Or are they instead playing
> >the role of the canary in the coal mine?
> That's a question I've been struggling with recently as well.  I
> suspect they (arm, power) are going to be the outliers rather than
> the canary. While the weaker model may give them some advantages WRT
> scalability, I don't think it'll ultimately be enough to overcome
> the difficulty in writing correct low level code for them.
> 
> Regardless, they're here and we have to deal with them.

Agreed...

Thanx, Paul

RE: About gsoc 2014 OpenMP 4.0 Projects

2014-02-25 Thread Evgeny Gavrin

Hi Guray,

There were two announcements: PTX-backend and OpenCL code generation.
Initial PTX-patches can be found in mailing list and OpenCL experiments in
openacc_1-0_branch.

Regarding GSoC it would be nice, if you'll apply with your proposal on code
generation.
I think that projects aimed to improve generation of OpenCL or
implementation of SPIR-backend are going to be useful for GCC.

-
Thanks,  
Evgeny.


-Original Message-
From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of
guray ozen
Sent: Tuesday, February 25, 2014 3:27 PM
To: gcc@gcc.gnu.org
Subject: About gsoc 2014 OpenMP 4.0 Projects

Hello,

I'm master student at high-performance computing at barcelona supercomputing
center. And I'm working on my thesis regarding openmp accelerator model
implementation onto our compiler (OmpSs). Actually i almost finished
implementation of all new directives  to generate CUDA code and same
implementation OpenCL doesn't take so much according to my design. But i
haven't even tried for Intel mic and apu other hardware accelerator :) Now
i'm bench-marking output kernel codes which are generated by my compiler.
although output kernel is generally naive, speedup is not very very bad.
when I compare results with HMPP OpenACC 3.2.x compiler, speedups are almost
same or in some cases my results are slightly better than. That's why in
this term, i am going to work on compiler level or runtime level
optimizations for gpus.

When i looked gcc openmp 4.0 project, i couldn't see any things about code
generation. Are you going to announce later? or should i apply gsoc with my
idea about code generations and device code optimizations?

Güray Özen
~grypp

Re: [RFC] Introducing MIPS O32 ABI Extension for FR0 and FR1 Interlinking

RE: [RFC] Introducing MIPS O32 ABI Extension for FR0 and FR1 Interlinking

About gsoc 2014 OpenMP 4.0 Projects

Re: [RFC] Introducing MIPS O32 ABI Extension for FR0 and FR1 Interlinking

Re: [RFC] Introducing MIPS O32 ABI Extension for FR0 and FR1 Interlinking

[GSoC] GCC has been accepted to GSoC 2014

RE: [RFC] Introducing MIPS O32 ABI Extension for FR0 and FR1 Interlinking

Re: [RFC][PATCH 0/5] arch: atomic rework

Re: [RFC][PATCH 0/5] arch: atomic rework

Re: [RFC][PATCH 0/5] arch: atomic rework

Re: [RFC][PATCH 0/5] arch: atomic rework

Re: [RFC][PATCH 0/5] arch: atomic rework

Re: [RFC][PATCH 0/5] arch: atomic rework

Re: [RFC][PATCH 0/5] arch: atomic rework

RE: About gsoc 2014 OpenMP 4.0 Projects

15 matches

Site Navigation

Mail list logo

Footer information