RE: negative latencies
Is it the case of code speculation where the negative latencies are used? Thanks & Regards Ajit -Original Message- From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of shmeel gutl Sent: Monday, May 19, 2014 12:23 PM To: Andrew Pinski Cc: gcc@gcc.gnu.org; Vladimir Makarov Subject: Re: negative latencies On 19-May-14 09:39 AM, Andrew Pinski wrote: > On Sun, May 18, 2014 at 11:13 PM, shmeel gutl > wrote: >> Are there hooks in gcc to deal with negative latencies? In other >> words, an architecture that permits an instruction to use a result >> from an instruction that will be issued later. > Do you mean bypasses? If so there is a bypass feature which you can use: > https://gcc.gnu.org/onlinedocs/gccint/Processor-pipeline-description.h > tml#index-data-bypass-3773 > > Thanks, > Andrew Pinski Unfortunately, bypasses in the pipeline description is not enough. They only allow you to calculate the latency of true dependencies. They are also forced to be zero or greater. The real question is how the scheduler and register allocator can deal with negative latencies. Thanks Shmeel >> At first glance it seems that it will will break a few things. >> 1) The definition of dependencies cannot come from the simple >> ordering of rtl. >> 2) The scheduling problem starts to look like "get off the train 3 >> stops before me". >> 3) The definition of live ranges needs to use actual instruction >> timing information, not just instruction sequencing. >> >> The hooks in the scheduler seem to be enough to stop damage but not >> enough to take advantage of this "feature". >> >> Thanks >> > > - > No virus found in this message. > Checked by AVG - www.avg.com > Version: 2014.0.4577 / Virus Database: 3950/7515 - Release Date: > 05/18/14
Re: Using particular register class (like floating point registers) as spill register class
On 05/16/2014 05:20 PM, Ian Bolton wrote: >> On 05/16/2014 12:05 PM, Kugan wrote: >>> >>> >>> On 16/05/14 20:40, pins...@gmail.com wrote: > On May 16, 2014, at 3:23 AM, Kugan >> wrote: > > I would like to know if there is anyway we can use registers from > particular register class just as spill registers (in places where > register allocator would normally spill to stack and nothing more), >> when > it can be useful. > > In AArch64, in some cases, compiling with -mgeneral-regs-only >> produces > better performance compared not using it. The difference here is >> that > when -mgeneral-regs-only is not used, floating point register are >> also > used in register allocation. Then IRA/LRA has to move them to core > registers before performing operations as shown below. Can you show the code with fp register disabled? Does it use the >> stack to spill? Normally this is due to register to register class >> costs compared to register to memory move cost. Also I think it >> depends on the processor rather the target. For thunder, using the fp >> registers might actually be better than using the stack depending if >> the stack was in L1. >>> Not all the LDR/STR combination match to fmov. In the testcase I >> have, >>> >>> aarch64-none-linux-gnu-gcc sha_dgst.c -O2 -S -mgeneral-regs-only >>> grep -c "ldr" sha_dgst.s >>> 50 >>> grep -c "str" sha_dgst.s >>> 42 >>> grep -c "fmov" sha_dgst.s >>> 0 >>> >>> aarch64-none-linux-gnu-gcc sha_dgst.c -O2 -S >>> grep -c "ldr" sha_dgst.s >>> 42 >>> grep -c "str" sha_dgst.s >>> 31 >>> grep -c "fmov" sha_dgst.s >>> 105 >>> >>> I am not saying that we shouldn't use floating point register here. >> But >>> from the above, it seems like register allocator is using it as more >>> like core register (even though the cost mode has higher cost) and >> then >>> moving the values to core registers before operations. if that is the >>> case, my question is, how do we just make this as spill register >> class >>> so that we will replace ldr/str with equal number of fmov when it is >>> possible. >> >> I'm also seeing stuff like this: >> >> => 0x7fb72a0928 > Thread*)+2500>: >> add x21, x4, x21, lsl #3 >> => 0x7fb72a092c > Thread*)+2504>: >> fmov w2, s8 >> => 0x7fb72a0930 > Thread*)+2508>: >> str w2, [x21,#88] >> >> I guess GCC doesn't know how to store an SImode value in an FP register >> into >> memory? This is 4.8.1. >> > > Please can you try that on trunk and report back. OK, this is trunk, and I'm not longer seeing that happen. However, I am seeing: 0x007fb76dc82c <+160>: adrpx25, 0x7fb7c8 0x007fb76dc830 <+164>: add x25, x25, #0x480 0x007fb76dc834 <+168>: fmovd8, x0 0x007fb76dc838 <+172>: add x0, x29, #0x160 0x007fb76dc83c <+176>: fmovd9, x0 0x007fb76dc840 <+180>: add x0, x29, #0xd8 0x007fb76dc844 <+184>: fmovd10, x0 0x007fb76dc848 <+188>: add x0, x29, #0xf8 0x007fb76dc84c <+192>: fmovd11, x0 followed later by: 0x007fb76dd224 <+2712>: fmovx0, d9 0x007fb76dd228 <+2716>: add x6, x29, #0x118 0x007fb76dd22c <+2720>: str x20, [x0,w27,sxtw #3] 0x007fb76dd230 <+2724>: fmovx0, d10 0x007fb76dd234 <+2728>: str w28, [x0,w27,sxtw #2] 0x007fb76dd238 <+2732>: fmovx0, d11 0x007fb76dd23c <+2736>: str w19, [x0,w27,sxtw #2] which seems a bit suboptimal, given that these double registers now have to be saved in the prologue.
Re: Using particular register class (like floating point registers) as spill register class
On Mon, May 19, 2014 at 1:02 PM, Andrew Haley wrote: > On 05/16/2014 05:20 PM, Ian Bolton wrote: >>> On 05/16/2014 12:05 PM, Kugan wrote: On 16/05/14 20:40, pins...@gmail.com wrote: > > >> On May 16, 2014, at 3:23 AM, Kugan >>> wrote: >> >> I would like to know if there is anyway we can use registers from >> particular register class just as spill registers (in places where >> register allocator would normally spill to stack and nothing more), >>> when >> it can be useful. >> >> In AArch64, in some cases, compiling with -mgeneral-regs-only >>> produces >> better performance compared not using it. The difference here is >>> that >> when -mgeneral-regs-only is not used, floating point register are >>> also >> used in register allocation. Then IRA/LRA has to move them to core >> registers before performing operations as shown below. > > Can you show the code with fp register disabled? Does it use the >>> stack to spill? Normally this is due to register to register class >>> costs compared to register to memory move cost. Also I think it >>> depends on the processor rather the target. For thunder, using the fp >>> registers might actually be better than using the stack depending if >>> the stack was in L1. Not all the LDR/STR combination match to fmov. In the testcase I >>> have, aarch64-none-linux-gnu-gcc sha_dgst.c -O2 -S -mgeneral-regs-only grep -c "ldr" sha_dgst.s 50 grep -c "str" sha_dgst.s 42 grep -c "fmov" sha_dgst.s 0 aarch64-none-linux-gnu-gcc sha_dgst.c -O2 -S grep -c "ldr" sha_dgst.s 42 grep -c "str" sha_dgst.s 31 grep -c "fmov" sha_dgst.s 105 I am not saying that we shouldn't use floating point register here. >>> But from the above, it seems like register allocator is using it as more like core register (even though the cost mode has higher cost) and >>> then moving the values to core registers before operations. if that is the case, my question is, how do we just make this as spill register >>> class so that we will replace ldr/str with equal number of fmov when it is possible. >>> >>> I'm also seeing stuff like this: >>> >>> => 0x7fb72a0928 >> Thread*)+2500>: >>> add x21, x4, x21, lsl #3 >>> => 0x7fb72a092c >> Thread*)+2504>: >>> fmov w2, s8 >>> => 0x7fb72a0930 >> Thread*)+2508>: >>> str w2, [x21,#88] >>> >>> I guess GCC doesn't know how to store an SImode value in an FP register >>> into >>> memory? This is 4.8.1. >>> >> >> Please can you try that on trunk and report back. > > OK, this is trunk, and I'm not longer seeing that happen. > > However, I am seeing: > >0x007fb76dc82c <+160>: adrpx25, 0x7fb7c8 >0x007fb76dc830 <+164>: add x25, x25, #0x480 >0x007fb76dc834 <+168>: fmovd8, x0 >0x007fb76dc838 <+172>: add x0, x29, #0x160 >0x007fb76dc83c <+176>: fmovd9, x0 >0x007fb76dc840 <+180>: add x0, x29, #0xd8 >0x007fb76dc844 <+184>: fmovd10, x0 >0x007fb76dc848 <+188>: add x0, x29, #0xf8 >0x007fb76dc84c <+192>: fmovd11, x0 > > followed later by: > >0x007fb76dd224 <+2712>: fmovx0, d9 >0x007fb76dd228 <+2716>: add x6, x29, #0x118 >0x007fb76dd22c <+2720>: str x20, [x0,w27,sxtw #3] >0x007fb76dd230 <+2724>: fmovx0, d10 >0x007fb76dd234 <+2728>: str w28, [x0,w27,sxtw #2] >0x007fb76dd238 <+2732>: fmovx0, d11 >0x007fb76dd23c <+2736>: str w19, [x0,w27,sxtw #2] > > which seems a bit suboptimal, given that these double registers now have > to be saved in the prologue. That looks a bit suspicious - Is there a pre-processed file you can put on to bugzilla for someone to take a look at with command line options et al ? I had a testcase that I was investigating a few days back from a benchmark that was doing SHA2 calculations. From my notes I'd been playing with REGISTER_MOVE_COST and MEMORY_MOVE_COST and additionally the extra moves appeared to disappear with -fno-schedule-insns. Remember however that on AArch64 we don't have sched-pressure on by default. regards Ramana >
RE: Using particular register class (like floating point registers) as spill register class
> > > > Please can you try that on trunk and report back. > > OK, this is trunk, and I'm not longer seeing that happen. > > However, I am seeing: > >0x007fb76dc82c <+160>: adrpx25, 0x7fb7c8 >0x007fb76dc830 <+164>: add x25, x25, #0x480 >0x007fb76dc834 <+168>: fmovd8, x0 >0x007fb76dc838 <+172>: add x0, x29, #0x160 >0x007fb76dc83c <+176>: fmovd9, x0 >0x007fb76dc840 <+180>: add x0, x29, #0xd8 >0x007fb76dc844 <+184>: fmovd10, x0 >0x007fb76dc848 <+188>: add x0, x29, #0xf8 >0x007fb76dc84c <+192>: fmovd11, x0 > > followed later by: > >0x007fb76dd224 <+2712>:fmovx0, d9 >0x007fb76dd228 <+2716>:add x6, x29, #0x118 >0x007fb76dd22c <+2720>:str x20, [x0,w27,sxtw #3] >0x007fb76dd230 <+2724>:fmovx0, d10 >0x007fb76dd234 <+2728>:str w28, [x0,w27,sxtw #2] >0x007fb76dd238 <+2732>:fmovx0, d11 >0x007fb76dd23c <+2736>:str w19, [x0,w27,sxtw #2] > > which seems a bit suboptimal, given that these double registers now > have > to be saved in the prologue. > Thanks for doing that. Many AArch64 improvements have gone in since 4.8 was released. I think we'd have to see the output for the whole function to determine whether that code is sane. I don't suppose the source code is shareable or you have a testcase for this you can share? Cheers, Ian
Re: Offload Library
Hello Ian, On 16 May 07:07, Ian Lance Taylor wrote: > On Fri, May 16, 2014 at 4:47 AM, Kirill Yukhin > wrote: > > > > To support the offloading features for Intel's Xeon Phi cards > > we need to add a foreign library (liboffload) into the gcc repository. > > README with build instructions is attached. > > Can you explain why this library should be part of GCC, and how GCC > would use it? I'm sure it's obvious to you but it's not obvious to > me. The ‘target’ clause of OpenMP 4.0 aka ‘offloading’ support is expected to be a part of libgomp. Every target platform that will be supported should implement a dedicated plugin for libgomp. The plugin for Xeon PHI is based on the liboffload functionality. This library also will provide compatibility for binaries built with ICC. -- Thanks, K > > Ian
Re: Offload Library
Hello, Thomas! On 16 May 19:30, Thomas Schwinge wrote: > On Fri, 16 May 2014 15:47:58 +0400, Kirill Yukhin > wrote: > > To support the offloading features for Intel's Xeon Phi cards > > we need to add a foreign library (liboffload) into the gcc repository. > > As written in the README, this library currently is specific to Intel > hardware (understandably, of course), and I assume also in the future is > to remain that way (?) -- should it thus get a more specific name in GCC, > than the generic liboffload? Yes, this library generates calls to Intel specific Coprocessor offload Interface (COI). I think, that name of library maybe changed, and when I’ll submit the patch We'll discuss it. > > Additionally to that sources we going to add few headers [...] > > and couple of new sources > > For interfacing with GCC, presumably. You haven't stated it explicitly, > but do I assume right that this work will be going onto the > gomp-4_0-branch, integrated with the offloading work developed there, as > a plugin for libgomp? Not exactly. I was talking about COI emulator, which will allow to perform testing of offload w/o any external library dependency and HW. Libgomp <-> liboffload plug-in is also ready, but it need no such an approval, so it’ll be submitted as separate patch. -- Thanks, K > Grüße, > Thomas
Re: Using particular register class (like floating point registers) as spill register class
On 05/19/2014 01:19 PM, Ramana Radhakrishnan wrote: > On Mon, May 19, 2014 at 1:02 PM, Andrew Haley wrote: >> On 05/16/2014 05:20 PM, Ian Bolton wrote: On 05/16/2014 12:05 PM, Kugan wrote: > > > On 16/05/14 20:40, pins...@gmail.com wrote: >> >> >>> On May 16, 2014, at 3:23 AM, Kugan wrote: >>> >>> I would like to know if there is anyway we can use registers from >>> particular register class just as spill registers (in places where >>> register allocator would normally spill to stack and nothing more), when >>> it can be useful. >>> >>> In AArch64, in some cases, compiling with -mgeneral-regs-only produces >>> better performance compared not using it. The difference here is that >>> when -mgeneral-regs-only is not used, floating point register are also >>> used in register allocation. Then IRA/LRA has to move them to core >>> registers before performing operations as shown below. >> >> Can you show the code with fp register disabled? Does it use the stack to spill? Normally this is due to register to register class costs compared to register to memory move cost. Also I think it depends on the processor rather the target. For thunder, using the fp registers might actually be better than using the stack depending if the stack was in L1. > Not all the LDR/STR combination match to fmov. In the testcase I have, > > aarch64-none-linux-gnu-gcc sha_dgst.c -O2 -S -mgeneral-regs-only > grep -c "ldr" sha_dgst.s > 50 > grep -c "str" sha_dgst.s > 42 > grep -c "fmov" sha_dgst.s > 0 > > aarch64-none-linux-gnu-gcc sha_dgst.c -O2 -S > grep -c "ldr" sha_dgst.s > 42 > grep -c "str" sha_dgst.s > 31 > grep -c "fmov" sha_dgst.s > 105 > > I am not saying that we shouldn't use floating point register here. But > from the above, it seems like register allocator is using it as more > like core register (even though the cost mode has higher cost) and then > moving the values to core registers before operations. if that is the > case, my question is, how do we just make this as spill register class > so that we will replace ldr/str with equal number of fmov when it is > possible. I'm also seeing stuff like this: => 0x7fb72a0928 >>> Thread*)+2500>: add x21, x4, x21, lsl #3 => 0x7fb72a092c >>> Thread*)+2504>: fmov w2, s8 => 0x7fb72a0930 >>> Thread*)+2508>: str w2, [x21,#88] I guess GCC doesn't know how to store an SImode value in an FP register into memory? This is 4.8.1. >>> >>> Please can you try that on trunk and report back. >> >> OK, this is trunk, and I'm not longer seeing that happen. >> >> However, I am seeing: >> >>0x007fb76dc82c <+160>: adrpx25, 0x7fb7c8 >>0x007fb76dc830 <+164>: add x25, x25, #0x480 >>0x007fb76dc834 <+168>: fmovd8, x0 >>0x007fb76dc838 <+172>: add x0, x29, #0x160 >>0x007fb76dc83c <+176>: fmovd9, x0 >>0x007fb76dc840 <+180>: add x0, x29, #0xd8 >>0x007fb76dc844 <+184>: fmovd10, x0 >>0x007fb76dc848 <+188>: add x0, x29, #0xf8 >>0x007fb76dc84c <+192>: fmovd11, x0 >> >> followed later by: >> >>0x007fb76dd224 <+2712>: fmovx0, d9 >>0x007fb76dd228 <+2716>: add x6, x29, #0x118 >>0x007fb76dd22c <+2720>: str x20, [x0,w27,sxtw #3] >>0x007fb76dd230 <+2724>: fmovx0, d10 >>0x007fb76dd234 <+2728>: str w28, [x0,w27,sxtw #2] >>0x007fb76dd238 <+2732>: fmovx0, d11 >>0x007fb76dd23c <+2736>: str w19, [x0,w27,sxtw #2] >> >> which seems a bit suboptimal, given that these double registers now have >> to be saved in the prologue. > > That looks a bit suspicious - Is there a pre-processed file you can > put on to bugzilla for someone to take a look at with command line > options et al ? I'll try, but I'm using precompiled headers so it's a bit tricky. I'll let you know. Andrew.
adding support for vxworks os variants
Hello, Here is a quick description of changes we would like to contribute to the VxWorks ports, with a preliminary query to maintainers on what would be the most appropriate form for such changes to be deemed acceptable: On a few CPU families, variants of the VxWorks OS are available. Typically, there is the base VxWorks 6 or AE (653) kernel & environment, then also: - a simulator (VxSim) on some targets, - a "CERT" variant of the OS to address requirements specific to safety certification standards - a "MILS" variant of the OS to address requirements specific to security standards - an "SMP" variant of the OS for multiprocessor systems. We (AdaCore) have been maintaining toolchains for a few of these variants over the years, with integrated facilities allowing easier uses of the toolchain directly from the command line. For mils, the set of changes is significant enough to warrant a specific triplet. I'll be posting the patches soon. For the other variants, the need for separate triplets is less clear. Indeed, what the changes do is essentially to control link time behavior, typically: - for VXSIM or SMP, the crt files and libraries we need to link with are located in a different directory - for CERT, the system entry points available to the application are all in a big object and we're not supposed to link in anything else by default The WindRiver environment drives everything through a GUI and Makefiles. E.g. for CERT, this explicitly links with -nostdlib to remove all the defaults, then add what is really needed/allowed. Working directly from the command line is often useful, and doing the correct thing (getting rid of inappropriate defaults, figuring out the correct of -Ls, ...) is cumbersome. For vxsim or smp, having entirely separate toolchains with different triplets for so minor differences seemed overkill and impractical for users, so we have added "-vxsim" and "-vxsmp" command line options to our toolchains to help. We have done the same for the cert variants, with a "-vxcert" command line option, but wonder if a separate triplet wouldn't actually be better in this case. One small concern is that the system toolchains don't know about the new options, and we think that it might be of interest to minimize the interface differences. Thoughts ? Thanks in advance for your feedback, With Kind Regards, Olivier
Re: adding support for vxworks os variants
On May 19, 2014, at 15:41 , Olivier Hainque wrote: > For vxsim or smp, having entirely separate toolchains with different triplets > for so minor differences seemed overkill and impractical for users, so we have > added "-vxsim" and "-vxsmp" command line options to our toolchains to help. > > We have done the same for the cert variants, with a "-vxcert" command line > option, but wonder if a separate triplet wouldn't actually be better in this > case. > > One small concern is that the system toolchains don't know about the new > options, and we think that it might be of interest to minimize the interface > differences. > > Thoughts ? One point I forgot to mention: we have considered the use of external spec files as an alternative strategy. We have started experimenting with it and don't yet have a lot of feedback on this scheme. Your opinion on this alternate option (how much more viable/flexible/acceptable it would likely be) would be most appreciated. I'm of course happy to provide extra details on what we have been doing if needed. Olivier
Re: [GSoC] writing test-case
Hi, On Thu, 15 May 2014, Richard Biener wrote: > To me predicate (and capture without expression or predicate) > differs from expression in that predicate is clearly a leaf of the > expression tree while we have to recurse into expression operands. > > Now, if we want to support applying predicates to the midst of an > expression, like > > (plus predicate(minus @0 @1) > @2) > (...) > > then this would no longer be true. At the moment you'd write > > (plus (minus@3 @0 @1) > @2) > if (predicate (@3)) > (...) > > which makes it clearer IMHO (with the decision tree building > you'd apply the predicates after matching the expression tree > anyway I suppose, so code generation would be equivalent). Syntaxwise I had this idea for adding generic predicates to expressions: (plus (minus @0 @1):predicate @2) (...) If prefix or suffix doesn't matter much, but using a different syntax to separate expression from predicate seems to make things clearer. Optionally adding things like and/or for predicates might also make sense: (plus (minus @0 @1):positive_p(@0) || positive_p(@1) @2) (...) Ciao, Michael.
[GSoC] first phase
Hi, Unfortunately I shall need to take this week off, due to university exams, which are up-to 27th May. I will start working from 28th on pattern matching with decision tree, and try to cover up for the first week. I am extremely sorry about this. I thought I would be able to do both during exam week, but the exam load has become too much -:( In the first phase (up-to 23rd June), I hope to get genmatch ready: a) pattern matching with decision tree. b) Add patterns to test genmatch. c) Depending upon the patterns, extending the meta-description d) Other fixes: * capturing outermost expressions. For example this pattern does not get simplified (match_and_simplify (plus@2 (negate @0) @1) if (!TYPE_SATURATING (TREE_TYPE (@2))) (minus @1 @0)) I guess this happens because in write_nary_simplifiers: if (s->match->type != OP_EXPR) continue; Maybe this is not correct way to fix this, should we also pass lhs to generated gimple_match_and_simplify ? I guess that would be the capture for outermost expression. For above pattern, I guess @2 represents lhs. So for this test-case: int foo (int x, int y) { int t1 = -x; int t2 = t1 + y; return t2; } t2 would be @2, t1 would be @0 and y would be @1. Is that correct ? This would create issues when lhs is NULL, for example, in call to built-in functions ? * avoid using statement expressions for code gen of expression * rewriting code-generator using visitor classes, and other refactoring (using std::string for example), etc. I have a very rough time-line in mind, for completing tasks: 28th may - 31st may a) Have test-case for each pattern present (except COND_EXPR) in match.pd I guess most of it is already done, a few patterns are remaining. b) Small fixes (for example, those mentioned above). c) Have an initial idea/prototype for implementing decision tree 1st June - 15th June a) Implementing decision tree b) Adding patterns in match.pd to test the decision tree in match.pd, and accompanying test-cases in tree-ssa/match-*.c 16th June - 23rd June a) Support for GENERIC code generation. b) Refactoring and backup time for backlog. GENERIC code generation: I am a bit confused about this. Currently, pattern matching is implemented for GENERIC. However I believe simplification is done on GIMPLE. For example: (match_and_simplify (plus (negate @0) @1) (minus @0 @1)) If given input is GENERIC , it would do matching on GENERIC, but shall transform (minus @0 @1) to it's GIMPLE equivalent. Is that correct ? * Should we have a separate GENERIC match-and-simplify API like for gimple instead of having GENERIC matching in gimple_match_and_simplify ? * Do we add another pattern type, something like generic_match_and_simplify that will do the transform on GENERIC for example: (generic_match_and_simplify (plus (negate @0) @1) (minus @0 @1)) would produce GENERIC equivalent of (minus @0 @1). or maybe keep match_and_simplify, and tell the transform operand to produce GENERIC. Something like: (match_and_simplify (plus (negate @0) @1) GENERIC: (minus @0 @1)) Another thing I would like to do in first phase is figure out dependencies of tree-ssa-forwprop on GENERIC folding (for instance fold_comparison patterns). Thanks and Regards, Prathamesh
Re: we are starting the wide int merge
Richard Sandiford writes: > Gerald Pfeifer writes: >> On Sat, 17 May 2014, Richard Sandiford wrote: >>> To rule out one possibility: which GCC are you using for stage1? >> >> I think that may the smoking gun. When I use GCC 4.7 to bootstrap, >> FreeBSD 8, 9 and 10 all build fine on i386 (= i486) and amd64. >> >> When I use the system compiler, which is GCC 4.2 on FreeBSD 8 and 9 >> and clang on FreeBSD 10, things fail on FreeBSD 10... >> >> ...with a bootstrap comparison failure of stages 2 and 3 on i386: >> https://redports.org/~gerald/20140518230801-31619-208277/gcc410-4.10.0.s20140518.log > > Do you get exactly the same comparison failures using clang and GCC 4.2 > as the stage1 compiler? That would rule out the system compiler > miscompiling stage1. I couldn't reproduce this with GCC 4.2 but I could with clang. The problem is that the C++ frontend's template instantation code has several instances of foo (..., bar (...), bar (...), ...), where bar (...) can create new decls. The numbering of the decls can then depend on which order the compiler chooses to evaluate the function arguments. This later causes code differences if the decl uids are used as tie-breakers to get a stable sort. I was just unlucky that this happened to trigger for the new wi:: code. :-) I'm testing a patch now. It might need more than one iteration, but hopefully I'll have something to submit tomorrow. Thanks, Richard
dynamic_cast of a reference and -fno-exceptions
Hi, should gcc warn at least if a dynamic_cast of a reference is used when -fno-exceptions is specified? At least 4.8.2 doesn't complain. If so, I can implement the fix. Example: struct Base { virtual void f(){} }; struct Der : Base {}; int main() { Der d; Base& b = d; dynamic_cast(b); } Daniel. -- Daniel F. Gutson Chief Engineering Officer, SPD San Lorenzo 47, 3rd Floor, Office 5 Córdoba, Argentina Phone: +54 351 4217888 / +54 351 4218211 Skype: dgutson
Re: negative latencies
On 19-May-14 01:02 PM, Ajit Kumar Agarwal wrote: Is it the case of code speculation where the negative latencies are used? No. It is an exposed pipeline where instructions read registers during the required cycle. So if one instruction produces its results in the third pipeline stage and a second instruction reads the register in the sixth pipeline stage. The second instruction can read the results of the first instruction even if it is issued three cycles earlier. Thanks & Regards Ajit -Original Message- From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of shmeel gutl Sent: Monday, May 19, 2014 12:23 PM To: Andrew Pinski Cc: gcc@gcc.gnu.org; Vladimir Makarov Subject: Re: negative latencies On 19-May-14 09:39 AM, Andrew Pinski wrote: On Sun, May 18, 2014 at 11:13 PM, shmeel gutl wrote: Are there hooks in gcc to deal with negative latencies? In other words, an architecture that permits an instruction to use a result from an instruction that will be issued later. Do you mean bypasses? If so there is a bypass feature which you can use: https://gcc.gnu.org/onlinedocs/gccint/Processor-pipeline-description.h tml#index-data-bypass-3773 Thanks, Andrew Pinski Unfortunately, bypasses in the pipeline description is not enough. They only allow you to calculate the latency of true dependencies. They are also forced to be zero or greater. The real question is how the scheduler and register allocator can deal with negative latencies. Thanks Shmeel At first glance it seems that it will will break a few things. 1) The definition of dependencies cannot come from the simple ordering of rtl. 2) The scheduling problem starts to look like "get off the train 3 stops before me". 3) The definition of live ranges needs to use actual instruction timing information, not just instruction sequencing. The hooks in the scheduler seem to be enough to stop damage but not enough to take advantage of this "feature". Thanks - No virus found in this message. Checked by AVG - www.avg.com Version: 2014.0.4577 / Virus Database: 3950/7515 - Release Date: 05/18/14 - No virus found in this message. Checked by AVG - www.avg.com Version: 2014.0.4577 / Virus Database: 3950/7517 - Release Date: 05/18/14
Zero/Sign extension elimination using value ranges
This is based on my earlier patch https://gcc.gnu.org/ml/gcc-patches/2013-10/msg00452.html. Before I post the new set of patches, I would like to make sure that I understood review comments and my idea makes sense and acceptable. Please let me know If I am missing anything or my assumptions are wrong. To recap the basic idea, when GIMPLE_ASSIGN stmts are expanded to RTL, if we can prove that zero/sign extension to fit the type is redundant, we can generate RTL without it. For example, when an expression is evaluated and it's value is assigned to variable of type short, the generated RTL currently look similar to (set (reg:SI 110) (zero_extend:SI (subreg:HI (reg:SI 117) 0))). Using value ranges, if we can show that the value of the expression which is present in register 117 is within the limits of short and there is no sign conversion, we do not need to perform zero_extend. Cases to handle here are : 1. Handling NOP_EXPR or CONVERT_EXPR that are in the IL because they are required for type correctness. We have two cases here: A) Mode is smaller than word_mode. This is usually from where the zero/sign extensions are showing up in final assembly. For example : int = (int) short which usually expands to (set (reg:SI ) (sext:SI (subreg:HI (reg:SI We can expand this (set (reg:SI ) (((reg:SI If following is true: 1. Value stored in RHS and LHS are of the same signedness 2. Type can hold the value. i.e., In cases like char = (char) short, we check that the value in short is representable char type. (i.e. look at the value range in RHS SSA_NAME and see if that can be represented in types of LHS without overflowing) Subreg here is not a paradoxical subreg. We are removing the subreg and zero/sign extend here. I am assuming here that QI/HI registers are represented in SImode (basically word_mode) with zero/sign extend is used as in (zero_extend:SI (subreg:HI (reg:SI 117)). B) Mode is larger than word_mode long = (long) int which usually expands to (set:DI (sext:DI (reg:SI))) We have to expand this as paradoxical subreg (set:DI (subreg:DI (reg:SI))) I am not sure that these cases results in actual zero/sign extensions being generated. Therefore I think we should skip this case altogether. 2. Second are promotions required by the target (PROMOTE_MODE) that do arithmetic on wider registers like: char = char + char In this case we will have the value ranges of RHS char1 and char2. We will have to compute the value range of (char1 + char2) in promoted mode (from the values range stored in char1 SSANAME and char2 SSA_NAME) and see if that value range can be represented in LHS type. Once again, if following is true, we can remove the subreg and zero/sign extension in assignment: 1. Value stored in RHS and LHS are of the same signedness 2. Type can hold the value. And also, when LHS is promoted and thus the target is (subreg:XX N), RHS has been expanded in XXmode. Dependent on the value-range and mode XX which is bigger than word mode, set this to a paradoxical subreg of the expanded result. However, since we are only interested in XXmode lesser than word_mode (that is where most of the final zero/sign extension asm are coming from), we don’t have to consider paradoxical subreg here. Does this make sense? Thanks, Kugan
Re: Zero/Sign extension elimination using value ranges
On Tue, May 20, 2014 at 12:27:31PM +1000, Kugan wrote: > 1. Handling NOP_EXPR or CONVERT_EXPR that are in the IL because they > are required for type correctness. We have two cases here: > > A) Mode is smaller than word_mode. This is usually from where the > zero/sign extensions are showing up in final assembly. > For example : > int = (int) short > which usually expands to > (set (reg:SI ) > (sext:SI (subreg:HI (reg:SI > We can expand this > (set (reg:SI ) (((reg:SI > > If following is true: > 1. Value stored in RHS and LHS are of the same signedness > 2. Type can hold the value. i.e., In cases like char = (char) short, we > check that the value in short is representable char type. (i.e. look at > the value range in RHS SSA_NAME and see if that can be represented in > types of LHS without overflowing) > > Subreg here is not a paradoxical subreg. We are removing the subreg and > zero/sign extend here. > > I am assuming here that QI/HI registers are represented in SImode > (basically word_mode) with zero/sign extend is used as in > (zero_extend:SI (subreg:HI (reg:SI 117)). Wouldn't it be better to just set proper flags on the SUBREG based on value range info (SUBREG_PROMOTED_VAR_P and SUBREG_PROMOTED_UNSIGNED_P)? Then not only the optimizers could eliminate in zext/sext when possible, but all other optimizations could benefit from that. Jakub