Re: [RFC] Introducing MIPS O32 ABI Extension for FR0 and FR1 Interlinking
Doug Gilmore writes: > On 02/24/2014 10:42 AM, Richard Sandiford wrote: >>... >>> AIUI the old form never really worked reliably due to things like >>> newlib's setjmp not preserving the odd-numbered registers, so it doesn't seem worth keeping around. Also, the old form is identified by the GNU attribute (4, 4) so it'd be easy for the linker to reject links between the old and the new form. >>> >>> That is true. You will have noticed a number of changes over recent >>> months to start fixing fp64 as currently defined but having found this >>> new solution then such fixes are no longer important. The lack of >>> support for gp32 fp64 in linux is further reason to permit redefining >>> it. Would you be happy to retain the same builtin defines for FP64 if >>> changing its behaviour (i.e. __mips_fpr=64)? >> >>I think that should be OK. I suppose a natural follow-on question >>is what __mips_fpr should be for -mfpxx. Maybe just 0? > I think we should think carefully about just making -mfp64 just disappear. > The support has existed for bare iron for quite a while, and we do internal > testing of MSA using -mfp64. I'd rather avoid a flag day. It would be > good to continue recognizing that object files with attribute (4, 4) > (-mfp64) are not compatible with other objects. Right, that was the idea. (4, 4) would always mean the current form of -mfp64 and the linker would reject links between (4, 4) and the new -mfp64 form. The flag day was more on the GCC and GAS side. I don't see the point in supporting both forms there at the same time, since it significantly complicates the interface and since AIUI the old form was never really suitable for production use. Thanks, Richard
RE: [RFC] Introducing MIPS O32 ABI Extension for FR0 and FR1 Interlinking
Richard Sandiford writes > Doug Gilmore writes: > > On 02/24/2014 10:42 AM, Richard Sandiford wrote: > >>... > >>> AIUI the old form never really worked reliably due to things like > >>> newlib's setjmp not preserving the odd-numbered registers, so it > >>> doesn't > seem worth keeping around. Also, the old form is identified by the > GNU attribute (4, 4) so it'd be easy for the linker to reject links > between the old and the new form. > >>> > >>> That is true. You will have noticed a number of changes over recent > >>> months to start fixing fp64 as currently defined but having found > >>> this new solution then such fixes are no longer important. The lack > >>> of support for gp32 fp64 in linux is further reason to permit > >>> redefining it. Would you be happy to retain the same builtin defines > >>> for FP64 if changing its behaviour (i.e. __mips_fpr=64)? > >> > >>I think that should be OK. I suppose a natural follow-on question is > >>what __mips_fpr should be for -mfpxx. Maybe just 0? > > I think we should think carefully about just making -mfp64 just > disappear. > > The support has existed for bare iron for quite a while, and we do > > internal testing of MSA using -mfp64. I'd rather avoid a flag day. > > It would be good to continue recognizing that object files with > > attribute (4, 4) > > (-mfp64) are not compatible with other objects. > > Right, that was the idea. (4, 4) would always mean the current form of > -mfp64 and the linker would reject links between (4, 4) and the new - > mfp64 form. > > The flag day was more on the GCC and GAS side. I don't see the point in > supporting both forms there at the same time, since it significantly > complicates the interface and since AIUI the old form was never really > suitable for production use. That sounds OK to me. I'm aiming to have an experimental implementation of the calling convention changes as soon as possible although I am having difficulties getting the frx calling convention working correctly. The problem is that frx needs to treat registers as 64bit sometimes and 32bit at other times. a) I need the aliasing that 32bit registers gives me (use of an even-numbered double clobbers the corresponding odd-numbered single. This is to prevent both the double and odd numbered single being used simultaneously. b) I need the 64bit register layout to ensure that 64bit values in caller-saved registers are saved as 64-bit (rather than 2x32-bit) and 32-bit registers are saved as 32-bit and never combined into a 64-bit save. Caller-save.c flattens the caller-save problem down to look at only hard registers not modes which is frustrating. It looks like caller-save.c would need a lot of work to achieve b) with 32-bit hard registers but I equally don't know how I could achieve a) for 64-bit registers. I suspect a) is marginally easier to solve in the end but have to find a way to say that using reg x as 64-bit prevents allocation of x+1 as 32-bit despite registers being 64-bit. The easy option is to go for 64-bit registers and never use odd-numbered registers for single-precision or double-precision but I don't really want frx to be limited to that if at all possible. Any suggestions? The special handling for callee-saved registers is not a problem (I think) as it is all backend code for that (assuming a or b is resolved). Regards, Matthew
About gsoc 2014 OpenMP 4.0 Projects
Hello, I'm master student at high-performance computing at barcelona supercomputing center. And I'm working on my thesis regarding openmp accelerator model implementation onto our compiler (OmpSs). Actually i almost finished implementation of all new directives to generate CUDA code and same implementation OpenCL doesn't take so much according to my design. But i haven't even tried for Intel mic and apu other hardware accelerator :) Now i'm bench-marking output kernel codes which are generated by my compiler. although output kernel is generally naive, speedup is not very very bad. when I compare results with HMPP OpenACC 3.2.x compiler, speedups are almost same or in some cases my results are slightly better than. That's why in this term, i am going to work on compiler level or runtime level optimizations for gpus. When i looked gcc openmp 4.0 project, i couldn't see any things about code generation. Are you going to announce later? or should i apply gsoc with my idea about code generations and device code optimizations? Güray Özen ~grypp
Re: [RFC] Introducing MIPS O32 ABI Extension for FR0 and FR1 Interlinking
Matthew Fortune writes: >> >> If we do end up using ELF flags then maybe adding two new EF_MIPS_ABI >> >> enums would be better. It's more likely to be trapped by old loaders >> >> and avoids eating up those precious remaining bits. >> > >> > Sound's reasonable but I'm still trying to determine how this >> > information can be propagated from loader to dynamic loader. >> >> The dynamic loader has access to the ELF headers so I didn't think it >> would need any help. > > As I understand it the dynamic loader only has specific access to the > program headers of the executable not the ELF headers. There is no > question that the dynamic loader has access to DSO ELF headers but we > need the start point too. Sorry, forgot about that. In that case maybe program headers would be best, like you say. I.e. we could use a combination of GNU attributes and a new program header, with the program header hopefully being more general than for just this case. I suppose this comes back to the thread from binutils@ last year about how to manage the dwindling number of free flags: https://www.sourceware.org/ml/binutils/2013-09/msg00039.html to https://www.sourceware.org/ml/binutils/2013-09/msg00099.html >> >> You didn't say specifically how a static program's crt code would >> >> know whether it was linked as modeless or in a specific FR mode. >> >> Maybe the linker could define a special hidden symbol? >> > >> > Why do you say crt rather than dlopen? The mode requirement should >> > only matter if you want to change it and dlopen should be able to >> > access information in the same way that a dynamic linker would. It may >> > seem redundant but perhaps we end up having to mark an executable with >> > mode requirements in two ways. The primary one being the ELF flag and >> > the secondary one being a processor specific program header. The ELF >> > flags are easy to use/already used for the program loader and when >> > scanning the needs of an object being loaded, but the program header >> > is something that is easy to inspect for an already-loaded object. >> > Overall though, a new program header would be sufficient in all cases, >> > with a few modifications here and there. >> >> Sorry, what I meant was: how would an executable built with -static be >> handled? And I was assuming it would be up to the executable's startup >> code to set the FR mode. That startup code (from glibc) would normally >> be modeless itself but would need to know whether any FR0 or FR1 objects >> were linked in. (FWIW ifuncs have a similar >> problem: without the loader to help, the startup code has to resolve the >> ifuncs itself. The static linker defines special symbols around a block >> of IRELATIVE relocs and then the startup code applies those relocs in a >> similar way to the dynamic linker. I was thinking a linker-defined >> symbol could be used to record the FR mode too.) >> >> But perhaps you were thinking of getting the kernel to set the FR mode >> instead? > > I was thinking the kernel would set an initial FR mode that was at least > compatible with the ELF flags. Do you feel all this should be done in > user space? We only get user mode FR control in MIPS r5 so this would > make it more challenging to get into FR1 mode for MIPS32r2. I'd prefer > not to be able to load an FR1 program than crash in the crt while trying > to turn it on. There is however some expectation that the kernel would > trap and emulate UFR on MIPS32r2 for the dynamic loader case anyway. Right -- the kernel needs to let userspace change FR if the dynamic loader case is going to work. And I think if it's handled by userspace for dynamic executables then it should be handled by userspace for static ones too. Especially since the mechanism used for static executables would then be the same as for bare metal, meaning that we only really have 2 cases rather than 3. > Is it OK to continue these library related discussions here or should I > split the bare metal handling to newlib and linux libraries to glibc? > There is value in keeping things together but equally it is perhaps off > topic. Not sure TBH, but noone's complained so far :-) Thanks, Richard
Re: [RFC] Introducing MIPS O32 ABI Extension for FR0 and FR1 Interlinking
Matthew Fortune writes: > That sounds OK to me. > > I'm aiming to have an experimental implementation of the calling > convention changes as soon as possible although I am having difficulties > getting the frx calling convention working correctly. > > The problem is that frx needs to treat registers as 64bit sometimes and > 32bit at other times. > a) I need the aliasing that 32bit registers gives me (use of an > even-numbered double clobbers the corresponding odd-numbered > single. This is to prevent both the double and odd numbered single being > used simultaneously. > b) I need the 64bit register layout to ensure that 64bit values in > caller-saved registers are saved as 64-bit (rather than 2x32-bit) and > 32-bit registers are saved as 32-bit and never combined into a 64-bit > save. Caller-save.c flattens the caller-save problem down to look at > only hard registers not modes which is frustrating. > > It looks like caller-save.c would need a lot of work to achieve b) with > 32-bit hard registers but I equally don't know how I could achieve a) > for 64-bit registers. I suspect a) is marginally easier to solve in the > end but have to find a way to say that using reg x as 64-bit prevents > allocation of x+1 as 32-bit despite registers being 64-bit. The easy > option is to go for 64-bit registers and never use odd-numbered > registers for single-precision or double-precision but I don't really > want frx to be limited to that if at all possible. Any suggestions? Treating it as a limited from of FR0 mode seems best. I don't think there's any practical way of doing (a) without making HARD_REGNO_NREGS be 2 for a DFmode FPR, at which point any wrong assumptions about paired registers in caller-save.c would kick in. We'd only be making this change in the next release cycle, and we really should look to move to LRA for that cycle too. caller-save.c is specific to reload and so wouldn't be a problem. Of course, you might need to do stuff in LRA instead. Thanks, Richard
[GSoC] GCC has been accepted to GSoC 2014
Hi All, GCC has been accepted as mentoring organization to Google Summer of Code 2014, and we are off to the races! If you want to be a GCC GSoC student check out the project idea page at http://gcc.gnu.org/wiki/SummerOfCode . Feel free to ask questions on IRC [1] and get in touch with your potential mentors. If you are not sure who to contact -- send me an email at maxim.kuvyr...@linaro.org. If you are a GCC developer then create a profile at http://www.google-melange.com/gsoc/homepage/google/gsoc2014 to be able to rank student applications . Once registered, connect with "GCC - GNU Compiler Collection" organization. If you actively want to mentor a student project, then note so in your GSoC connection request. If you have any questions or comments please contact your friendly GSoC admin via IRC (maximk), email (maxim.kuvyr...@linaro.org) or Skype/Hangouts. Thank you, [1] irc://irc.oftc.net/#gcc -- Maxim Kuvyrkov www.linaro.org
RE: [RFC] Introducing MIPS O32 ABI Extension for FR0 and FR1 Interlinking
> Matthew Fortune writes: > >> >> If we do end up using ELF flags then maybe adding two new > >> >> EF_MIPS_ABI enums would be better. It's more likely to be trapped > >> >> by old loaders and avoids eating up those precious remaining bits. > >> > > >> > Sound's reasonable but I'm still trying to determine how this > >> > information can be propagated from loader to dynamic loader. > >> > >> The dynamic loader has access to the ELF headers so I didn't think it > >> would need any help. > > > > As I understand it the dynamic loader only has specific access to the > > program headers of the executable not the ELF headers. There is no > > question that the dynamic loader has access to DSO ELF headers but we > > need the start point too. > > Sorry, forgot about that. In that case maybe program headers would be > best, like you say. I.e. we could use a combination of GNU attributes > and a new program header, with the program header hopefully being more > general than for just this case. I suppose this comes back to the > thread from binutils@ last year about how to manage the dwindling number > of free flags: > > https://www.sourceware.org/ml/binutils/2013-09/msg00039.html > to https://www.sourceware.org/ml/binutils/2013-09/msg00099.html > > >> >> You didn't say specifically how a static program's crt code would > >> >> know whether it was linked as modeless or in a specific FR mode. > >> >> Maybe the linker could define a special hidden symbol? > >> > > >> > Why do you say crt rather than dlopen? The mode requirement should > >> > only matter if you want to change it and dlopen should be able to > >> > access information in the same way that a dynamic linker would. It > >> > may seem redundant but perhaps we end up having to mark an > >> > executable with mode requirements in two ways. The primary one > >> > being the ELF flag and the secondary one being a processor specific > >> > program header. The ELF flags are easy to use/already used for the > >> > program loader and when scanning the needs of an object being > >> > loaded, but the program header is something that is easy to inspect > for an already-loaded object. > >> > Overall though, a new program header would be sufficient in all > >> > cases, with a few modifications here and there. > >> > >> Sorry, what I meant was: how would an executable built with -static > >> be handled? And I was assuming it would be up to the executable's > >> startup code to set the FR mode. That startup code (from glibc) > >> would normally be modeless itself but would need to know whether any > >> FR0 or FR1 objects were linked in. (FWIW ifuncs have a similar > >> problem: without the loader to help, the startup code has to resolve > >> the ifuncs itself. The static linker defines special symbols around > >> a block of IRELATIVE relocs and then the startup code applies those > >> relocs in a similar way to the dynamic linker. I was thinking a > >> linker-defined symbol could be used to record the FR mode too.) > >> > >> But perhaps you were thinking of getting the kernel to set the FR > >> mode instead? > > > > I was thinking the kernel would set an initial FR mode that was at > > least compatible with the ELF flags. Do you feel all this should be > > done in user space? We only get user mode FR control in MIPS r5 so > > this would make it more challenging to get into FR1 mode for MIPS32r2. > > I'd prefer not to be able to load an FR1 program than crash in the crt > > while trying to turn it on. There is however some expectation that the > > kernel would trap and emulate UFR on MIPS32r2 for the dynamic loader > case anyway. > > Right -- the kernel needs to let userspace change FR if the dynamic > loader case is going to work. And I think if it's handled by userspace > for dynamic executables then it should be handled by userspace for > static ones too. Especially since the mechanism used for static > executables would then be the same as for bare metal, meaning that we > only really have 2 cases rather than 3. Although the dynamic case does mean mode switching must be possible at user level I do think it is important for the OS and bare metal crt to prepare an environment that is suitable for the original program including setting an appropriate FR mode. I would use the existing support for linux and bare metal for getting the fr mode correct for O32 vs N[32|64] as a basis for this. This initial guarantee would be quite helpful especially in the static link for linux userland as it simply wouldn't need to worry. I can understand the desire to keep the number of mechanisms to set FR mode to a minimum but the fact that bare metal runs privileged and linux userland runs unprivileged says to me that they will naturally take different paths on some of this. There are other aspects such as whether the kernel informs user land that UFR is available or not, via HWCAPs and consideration over what point we would want to see a failure when mode requ
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 24, 2014 at 10:05:52PM -0800, Linus Torvalds wrote: > On Mon, Feb 24, 2014 at 3:35 PM, Linus Torvalds > wrote: > > > > Litmus test 1: > > > > p = atomic_read(pp, consume); > > if (p == &variable) > > return p->val; > > > >is *NOT* ordered > > Btw, don't get me wrong. I don't _like_ it not being ordered, and I > actually did spend some time thinking about my earlier proposal on > strengthening the 'consume' ordering. Understood. > I have for the last several years been 100% convinced that the Intel > memory ordering is the right thing, and that people who like weak > memory ordering are wrong and should try to avoid reproducing if at > all possible. But given that we have memory orderings like power and > ARM, I don't actually see a sane way to get a good strong ordering. > You can teach compilers about cases like the above when they actually > see all the code and they could poison the value chain etc. But it > would be fairly painful, and once you cross object files (or even just > functions in the same compilation unit, for that matter), it goes from > painful to just "ridiculously not worth it". And I have indeed seen a post or two from you favoring stronger memory ordering over the past few years. ;-) > So I think the C semantics should mirror what the hardware gives us - > and do so even in the face of reasonable optimizations - not try to do > something else that requires compilers to treat "consume" very > differently. I am sure that a great many people would jump for joy at the chance to drop any and all RCU-related verbiage from the C11 and C++11 standards. (I know, you aren't necessarily advocating this, but given what you say above, I cannot think what verbiage that would remain.) The thing that makes me very nervous is how much the definition of "reasonable optimization" has changed. For example, before the 2.6.10 Linux kernel, we didn't even apply volatile semantics to fetches of RCU-protected pointers -- and as far as I know, never needed to. But since then, there have been several cases where the compiler happily hoisted a normal load out of a surprisingly large loop. Hardware advances can come into play as well. For example, my very first RCU work back in the early 90s was on a parallel system whose CPUs had no branch-prediction hardware (80386 or 80486, I don't remember which). Now people talk about compilers using branch prediction hardware to implement value-speculation optimizations. Five or ten years from now, who knows what crazy optimizations might be considered to be completely reasonable? Are ARM and Power really the bad boys here? Or are they instead playing the role of the canary in the coal mine? > If people made me king of the world, I'd outlaw weak memory ordering. > You can re-order as much as you want in hardware with speculation etc, > but you should always *check* your speculation and make it *look* like > you did everything in order. Which is pretty much the intel memory > ordering (ignoring the write buffering). Speaking as someone who got whacked over the head with DEC Alpha when first presenting RCU to the Digital UNIX folks long ago, I do have some sympathy with this line of thought. But as you say, it is not the world we currently live in. Of course, in the final analysis, your kernel, your call. Thanx, Paul
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 24, 2014 at 10:00 PM, Paul E. McKenney wrote: > > So let me see if I understand your reasoning. My best guess is that it > goes something like this: > > 1. The Linux kernel contains code that passes pointers from > rcu_dereference() through external functions. No, actually, it's not so much Linux-specific at all. I'm actually thinking about what I'd do as a compiler writer, and as a defender the "C is a high-level assembler" concept. I love C. I'm a huge fan. I think it's a great language, and I think it's a great language not because of some theoretical issues, but because it is the only language around that actually maps fairly well to what machines really do. And it's a *simple* language. Sure, it's not quite as simple as it used to be, but look at how thin the "K&R book" is. Which pretty much describes it - still. That's the real strength of C, and why it's the only language serious people use for system programming. Ignore C++ for a while (Jesus Xavier Christ, I've had to do C++ programming for subsurface), and just think about what makes _C_ a good language. I can look at C code, and I can understand what the code generation is, and what it will really *do*. And I think that's important. Abstractions that hide what the compiler will actually generate are bad abstractions. And ok, so this is obviously Linux-specific in that it's generally only Linux where I really care about the code generation, but I do think it's a bigger issue too. So I want C features to *map* to the hardware features they implement. The abstractions should match each other, not fight each other. > Actually, the fact that there are more potential optimizations than I can > think of is a big reason for my insistence on the carries-a-dependency > crap. My lack of optimization omniscience makes me very nervous about > relying on there never ever being a reasonable way of computing a given > result without preserving the ordering. But if I can give two clear examples that are basically identical from a syntactic standpoint, and one clearly can be trivially optimized to the point where the ordering guarantee goes away, and the other cannot, and you cannot describe the difference, then I think your description is seriously lacking. And I do *not* think the C language should be defined by how it can be described. Leave that to things like Haskell or LISP, where the goal is some kind of completeness of the language that is about the language, not about the machines it will run on. >> So the code sequence I already mentioned is *not* ordered: >> >> Litmus test 1: >> >> p = atomic_read(pp, consume); >> if (p == &variable) >> return p->val; >> >>is *NOT* ordered, because the compiler can trivially turn this into >> "return variable.val", and break the data dependency. > > Right, given your model, the compiler is free to produce code that > doesn't order the load from pp against the load from p->val. Yes. Note also that that is what existing compilers would actually do. And they'd do it "by mistake": they'd load the address of the variable into a register, and then compare the two registers, and then end up using _one_ of the registers as the base pointer for the "p->val" access, but I can almost *guarantee* that there are going to be sequences where some compiler will choose one register over the other based on some random detail. So my model isn't just a "model", it also happens to descibe reality. > Indeed, it won't work across different compilation units unless > the compiler is told about it, which is of course the whole point of > [[carries_dependency]]. Understood, though, the Linux kernel currently > does not have anything that could reasonably automatically generate those > [[carries_dependency]] attributes. (Or are there other reasons why you > believe [[carries_dependency]] is problematic?) So I think carries_dependency is problematic because: - it's not actually in C11 afaik - it requires the programmer to solve the problem of the standard not matching the hardware. - I think it's just insanely ugly, *especially* if it's actually meant to work so that the current carries-a-dependency works even for insane expressions like "a-a". in practice, it's one of those things where I guess nobody actually would ever use it. > Of course, I cannot resist putting forward a third litmus test: > > static struct foo variable1; > static struct foo variable2; > static struct foo *pp = &variable1; > > T1: initialize_foo(&variable2); > atomic_store_explicit(&pp, &variable2, memory_order_release); > /* The above is the only store to pp in this translation unit, > * and the address of pp is not exported in any way. > */ > > T2: if (p == &variable1) > return p->val1; /* Must be variable1.val1. */ > else > return p->val2; /* Must be variable2.val2. */ > > My guess is that you
Re: [RFC][PATCH 0/5] arch: atomic rework
wrote: > wrote: >> I have for the last several years been 100% convinced that the Intel >> memory ordering is the right thing, and that people who like weak >> memory ordering are wrong and should try to avoid reproducing if at >> all possible. > > Are ARM and Power really the bad boys here? Or are they instead playing > the role of the canary in the coal mine? To paraphrase some older threads, I think Linus's argument is that weak memory ordering is like branch delay slots: a way to make a simple implementation simpler, but ends up being no help to a more aggressive implementation. Branch delay slots give a one-cycle bonus to in-order cores, but once you go superscalar and add branch prediction, they stop helping, and once you go full out of order, they're just an annoyance. Likewise, I can see the point that weak ordering can help make a simple cache interface simpler, but once you start doing speculative loads, you've already bought and paid for all the hardware you need to do stronger coherency. Another thing that requires all the strong-coherency machinery is a high-performance implementation of the various memory barrier and synchronization operations. Yes, a low-performance (drain the pipeline) implementation is tolerable if the instructions aren't used frequently, but once you're really trying, it doesn't save complexity. Once you're there, strong coherency always doesn't actually cost you any time outside of critical synchronization code, and it both simplifies and speeds up the tricky synchronization software. So PPC and ARM's weak ordering are not the direction the future is going. Rather, weak ordering is something that's only useful in a limited technology window, which is rapidly passing. If you can find someone in IBM who's worked on the Z series cache coherency (extremely strong ordering), they probably have some useful insights. The big question is if strong ordering, once you've accepted the implementation complexity and area, actually costs anything in execution time. If there's an unavoidable cost which weak ordering saves, that's significant.
Re: [RFC][PATCH 0/5] arch: atomic rework
On 02/25/14 17:15, Paul E. McKenney wrote: I have for the last several years been 100% convinced that the Intel memory ordering is the right thing, and that people who like weak memory ordering are wrong and should try to avoid reproducing if at all possible. But given that we have memory orderings like power and ARM, I don't actually see a sane way to get a good strong ordering. You can teach compilers about cases like the above when they actually see all the code and they could poison the value chain etc. But it would be fairly painful, and once you cross object files (or even just functions in the same compilation unit, for that matter), it goes from painful to just "ridiculously not worth it". And I have indeed seen a post or two from you favoring stronger memory ordering over the past few years. ;-) I couldn't agree more. Are ARM and Power really the bad boys here? Or are they instead playing the role of the canary in the coal mine? That's a question I've been struggling with recently as well. I suspect they (arm, power) are going to be the outliers rather than the canary. While the weaker model may give them some advantages WRT scalability, I don't think it'll ultimately be enough to overcome the difficulty in writing correct low level code for them. Regardless, they're here and we have to deal with them. Jeff
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 25, 2014 at 05:47:03PM -0800, Linus Torvalds wrote: > On Mon, Feb 24, 2014 at 10:00 PM, Paul E. McKenney > wrote: > > > > So let me see if I understand your reasoning. My best guess is that it > > goes something like this: > > > > 1. The Linux kernel contains code that passes pointers from > > rcu_dereference() through external functions. > > No, actually, it's not so much Linux-specific at all. > > I'm actually thinking about what I'd do as a compiler writer, and as a > defender the "C is a high-level assembler" concept. > > I love C. I'm a huge fan. I think it's a great language, and I think > it's a great language not because of some theoretical issues, but > because it is the only language around that actually maps fairly well > to what machines really do. > > And it's a *simple* language. Sure, it's not quite as simple as it > used to be, but look at how thin the "K&R book" is. Which pretty much > describes it - still. > > That's the real strength of C, and why it's the only language serious > people use for system programming. Ignore C++ for a while (Jesus > Xavier Christ, I've had to do C++ programming for subsurface), and > just think about what makes _C_ a good language. The last time I used C++ for a project was in 1990. It was a lot smaller then. > I can look at C code, and I can understand what the code generation > is, and what it will really *do*. And I think that's important. > Abstractions that hide what the compiler will actually generate are > bad abstractions. > > And ok, so this is obviously Linux-specific in that it's generally > only Linux where I really care about the code generation, but I do > think it's a bigger issue too. > > So I want C features to *map* to the hardware features they implement. > The abstractions should match each other, not fight each other. OK... > > Actually, the fact that there are more potential optimizations than I can > > think of is a big reason for my insistence on the carries-a-dependency > > crap. My lack of optimization omniscience makes me very nervous about > > relying on there never ever being a reasonable way of computing a given > > result without preserving the ordering. > > But if I can give two clear examples that are basically identical from > a syntactic standpoint, and one clearly can be trivially optimized to > the point where the ordering guarantee goes away, and the other > cannot, and you cannot describe the difference, then I think your > description is seriously lacking. In my defense, my plan was to constrain the compiler to retain the ordering guarantee in either case. Yes, I did notice that you find that unacceptable. > And I do *not* think the C language should be defined by how it can be > described. Leave that to things like Haskell or LISP, where the goal > is some kind of completeness of the language that is about the > language, not about the machines it will run on. I am with you up to the point that the fancy optimizers start kicking in. I don't know how to describe what the optimizers are and are not permitted to do strictly in terms of the underlying hardware. > >> So the code sequence I already mentioned is *not* ordered: > >> > >> Litmus test 1: > >> > >> p = atomic_read(pp, consume); > >> if (p == &variable) > >> return p->val; > >> > >>is *NOT* ordered, because the compiler can trivially turn this into > >> "return variable.val", and break the data dependency. > > > > Right, given your model, the compiler is free to produce code that > > doesn't order the load from pp against the load from p->val. > > Yes. Note also that that is what existing compilers would actually do. > > And they'd do it "by mistake": they'd load the address of the variable > into a register, and then compare the two registers, and then end up > using _one_ of the registers as the base pointer for the "p->val" > access, but I can almost *guarantee* that there are going to be > sequences where some compiler will choose one register over the other > based on some random detail. > > So my model isn't just a "model", it also happens to descibe reality. Sounds to me like your model -is- reality. I believe that it is useful to constrain reality from time to time, but understand that you vehemently disagree. > > Indeed, it won't work across different compilation units unless > > the compiler is told about it, which is of course the whole point of > > [[carries_dependency]]. Understood, though, the Linux kernel currently > > does not have anything that could reasonably automatically generate those > > [[carries_dependency]] attributes. (Or are there other reasons why you > > believe [[carries_dependency]] is problematic?) > > So I think carries_dependency is problematic because: > > - it's not actually in C11 afaik Indeed it is not, but I bet that gcc will implement it like it does the other attributes that are not part of C11. > - it requires the programmer to solve the pr
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 25, 2014 at 10:06:53PM -0500, George Spelvin wrote: > wrote: > > wrote: > >> I have for the last several years been 100% convinced that the Intel > >> memory ordering is the right thing, and that people who like weak > >> memory ordering are wrong and should try to avoid reproducing if at > >> all possible. > > > > Are ARM and Power really the bad boys here? Or are they instead playing > > the role of the canary in the coal mine? > > To paraphrase some older threads, I think Linus's argument is that > weak memory ordering is like branch delay slots: a way to make a simple > implementation simpler, but ends up being no help to a more aggressive > implementation. > > Branch delay slots give a one-cycle bonus to in-order cores, but > once you go superscalar and add branch prediction, they stop helping, > and once you go full out of order, they're just an annoyance. > > Likewise, I can see the point that weak ordering can help make a simple > cache interface simpler, but once you start doing speculative loads, > you've already bought and paid for all the hardware you need to do > stronger coherency. > > Another thing that requires all the strong-coherency machinery is > a high-performance implementation of the various memory barrier and > synchronization operations. Yes, a low-performance (drain the pipeline) > implementation is tolerable if the instructions aren't used frequently, > but once you're really trying, it doesn't save complexity. > > Once you're there, strong coherency always doesn't actually cost you any > time outside of critical synchronization code, and it both simplifies > and speeds up the tricky synchronization software. > > > So PPC and ARM's weak ordering are not the direction the future is going. > Rather, weak ordering is something that's only useful in a limited > technology window, which is rapidly passing. That does indeed appear to be Intel's story. Might well be correct. Time will tell. > If you can find someone in IBM who's worked on the Z series cache > coherency (extremely strong ordering), they probably have some useful > insights. The big question is if strong ordering, once you've accepted > the implementation complexity and area, actually costs anything in > execution time. If there's an unavoidable cost which weak ordering saves, > that's significant. There has been a lot of ink spilled on this argument. ;-) PPC has much larger CPU counts than does the mainframe. On the other hand, there are large x86 systems. Some claim that there are differences in latency due to the different approaches, and there could be a long argument about whether all this in inherent in the memory ordering or whether it is due to implementation issues. I don't claim to know the answer. I do know that ARM and PPC are here now, and that I need to deal with them. Thanx, Paul
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 25, 2014 at 08:32:38PM -0700, Jeff Law wrote: > On 02/25/14 17:15, Paul E. McKenney wrote: > >>I have for the last several years been 100% convinced that the Intel > >>memory ordering is the right thing, and that people who like weak > >>memory ordering are wrong and should try to avoid reproducing if at > >>all possible. But given that we have memory orderings like power and > >>ARM, I don't actually see a sane way to get a good strong ordering. > >>You can teach compilers about cases like the above when they actually > >>see all the code and they could poison the value chain etc. But it > >>would be fairly painful, and once you cross object files (or even just > >>functions in the same compilation unit, for that matter), it goes from > >>painful to just "ridiculously not worth it". > > > >And I have indeed seen a post or two from you favoring stronger memory > >ordering over the past few years. ;-) > I couldn't agree more. > > > > >Are ARM and Power really the bad boys here? Or are they instead playing > >the role of the canary in the coal mine? > That's a question I've been struggling with recently as well. I > suspect they (arm, power) are going to be the outliers rather than > the canary. While the weaker model may give them some advantages WRT > scalability, I don't think it'll ultimately be enough to overcome > the difficulty in writing correct low level code for them. > > Regardless, they're here and we have to deal with them. Agreed... Thanx, Paul
RE: About gsoc 2014 OpenMP 4.0 Projects
Hi Guray, There were two announcements: PTX-backend and OpenCL code generation. Initial PTX-patches can be found in mailing list and OpenCL experiments in openacc_1-0_branch. Regarding GSoC it would be nice, if you'll apply with your proposal on code generation. I think that projects aimed to improve generation of OpenCL or implementation of SPIR-backend are going to be useful for GCC. - Thanks, Evgeny. -Original Message- From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of guray ozen Sent: Tuesday, February 25, 2014 3:27 PM To: gcc@gcc.gnu.org Subject: About gsoc 2014 OpenMP 4.0 Projects Hello, I'm master student at high-performance computing at barcelona supercomputing center. And I'm working on my thesis regarding openmp accelerator model implementation onto our compiler (OmpSs). Actually i almost finished implementation of all new directives to generate CUDA code and same implementation OpenCL doesn't take so much according to my design. But i haven't even tried for Intel mic and apu other hardware accelerator :) Now i'm bench-marking output kernel codes which are generated by my compiler. although output kernel is generally naive, speedup is not very very bad. when I compare results with HMPP OpenACC 3.2.x compiler, speedups are almost same or in some cases my results are slightly better than. That's why in this term, i am going to work on compiler level or runtime level optimizations for gpus. When i looked gcc openmp 4.0 project, i couldn't see any things about code generation. Are you going to announce later? or should i apply gsoc with my idea about code generations and device code optimizations? Güray Özen ~grypp