Re: dom1 prevents vectorization via partial loop peeling?
On Mon, Apr 27, 2015 at 7:06 PM, Jeff Law wrote: > On 04/27/2015 10:12 AM, Alan Lawrence wrote: >> >> >> After copyrename3, immediately prior to dom1, the loop body looks like: >> >>: >> >>: >># i_11 = PHI >>_5 = a[i_11]; >>_6 = i_11 & _5; >>if (_6 != 0) >> goto ; >>else >> goto ; >> >>: >> >>: >># m_2 = PHI <5(4), 4(3)> >>_7 = m_2 * _5; >>b[i_11] = _7; >>i_9 = i_11 + 1; >>if (i_9 != 32) >> goto ; >>else >> goto ; >> >>: >>goto ; >> >>: >>return; >> >> dom1 then peeled part of the first loop iteration, producing: > > Yup. The jump threading code realized that if we traverse the edge 2->3, > then we'll always traverse 3->5. The net effect is like peeling the first > iteration because we copy bb3. The original will be reached via 7->3 (it, > loop iterating), the copy via 2->3' and 3' will have its conditional removed > and will unconditionally transfer control to bb5. > > > > This is a known problem, but we really don't have any good heuristics for > when to do this vs when not to do it. > > >> >> In contrast, a slightly-different testcase: >> >> #define N 32 >> >> int a[N]; >> int b[N]; >> >> int foo () >> { >>for (int i = 0; i < N ; i++) >> { >>int cond = (a[i] & i) ? -1 : 0; // extra variable here >>int m = (cond) ? 5 : 4; >>b[i] = a[i] * m; >> } >> } >> >> after copyrename3, just before dom1, is only slightly different: >> >>: >> >>: >># i_15 = PHI >>_6 = a[i_15]; >>_7 = i_15 & _6; >>if (_7 != 0) >> goto ; >>else >> goto ; >> >>: >># m_3 = PHI <4(6), 5(3)> >>_8 = m_3 * _6; >>b[i_15] = _8; >>i_10 = i_15 + 1; >>if (i_10 != 32) >> goto ; >>else >> goto ; >> >>: >>goto ; >> >>: >>return; >> >>: >>goto ; >> >> with bb6 being out-of-line at the end of the function, rather than bb4 >> falling through from just above bb5. However, this prevents dom1 from >> doing the partial peeling, and dom1 only changes the "goto bb7" into a >> "goto bb3": > > I would have still expected it to thread 2->3, 3->6->4 > > >> >> (1) dom1 should really, in the second case, perform the same partial >> peeling that it does in the first testcase, if that is what it thinks is >> desirable. (Of course, we might want to fix that only later, as ATM >> that'd take us backwards). > > Please a file a BZ. It could be something simple, or we might be hitting > one of Zdenek's heuristics around keeping overall loop structure. > >> >> Alternatively, maybe we don't want dom1 doing that sort of thing (?), >> but I'm inclined to think that if it's doing such optimizations, it's >> for a good reason ;) I guess there'll be other times where we *cannot* >> do partially peeling of later iterations... > > It's an open question -- we never reached any kind of conclusion when it was > last discussed with Zdenek. I think the fundamental issue is we can't > really predict when threading the loop is going to interfere with later > optimizations or not. The heuristics we have are marginal at best. > > The one thought we've never explored was re-rolling that first iteration > back into the loop in the vectorizer. Well. In this case we hit /* If one of the loop header's edge is an exit edge then do not apply if-conversion. */ FOR_EACH_EDGE (e, ei, loop->header->succs) if (loop_exit_edge_p (loop, e)) return false; which is simply because even after if-conversion we'll at least end up with a non-empty latch block which is what the vectorizer doesn't support. DOM rotated the loop into this non-canonical form. Running loop header copying again would probably undo this. Richard. > > Jeff
Re: missing explanation of Stage 4 in GCC Development Plan document
On Tue, Apr 28, 2015 at 7:01 AM, Thomas Preud'homme wrote: >> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On >> Behalf Of James Greenhalgh > > Hi James, > >> >> The stages, timings, and exact rules for which patches are acceptable >> and when, seem to have drifted quite substantially from that page. >> Stage 2 has been missing for 7 years now, Stages 3 and 4 seem to blur >> together, the "regression only" rule is more like "non-invasive fixes >> only" (likewise for the support branches). > > Don't stage3 and stage4 differ in that substantial changes are still allowed > for backends in stage3? stage3 is for _general_ bugfixing while stage4 is for _regression_ bugfixing. Richard. > >> >> So, why not try to reflect practice and form a two stage model (and >> name the stages in a descriptive fashion)? >> >> Development: >> >> Expected to last for around 70% of a release cycle. During this >> period, changes of any nature may be made to the compiler. In >> particular, >> major changes may be merged from branches. In order to avoid chaos, >> the Release Managers will ask for a list of major projects proposed for >> the coming release cycle before the start of this stage. They will >> attempt to sequence the projects in such a way as to cause minimal >> disruption. The Release Managers will not reject projects that will be >> ready for inclusion before the end of the development phase. Similarly, >> the Release Managers have no special power to accept a particular >> patch or branch beyond what their status as maintainers affords. >> The role of the Release Managers is merely to attempt to order the >> inclusion of major features in an organized manner. >> >> Stabilization: >> >> Expected to last for around 30% of a release cycle. New functionality >> may not be introduced during this period. Changes during this phase >> of the release cycle should focus on preparing the trunk for a high >> quality release, free of major regression and code generation issues. >> As we near the end of a release cycle, changes will only be accepted >> where they fix a regression, or are sufficiently non-intrusive as to >> not introduce a risk of affecting the quality of the release. > > If we keep referring to stages in our communication it would be nice to > document them. I'm not saying this rewording is wrong, I just think we > should add 1-2 sentences to explain the stages (I know it confused me > at first because stage4 was not listed). Alternatively we could just > refer to these 2 names only (development and stabilization). > > Best regards, > > Thomas > >
RE: dom1 prevents vectorization via partial loop peeling?
-Original Message- From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Richard Biener Sent: Tuesday, April 28, 2015 4:12 PM To: Jeff Law Cc: Alan Lawrence; gcc@gcc.gnu.org Subject: Re: dom1 prevents vectorization via partial loop peeling? On Mon, Apr 27, 2015 at 7:06 PM, Jeff Law wrote: > On 04/27/2015 10:12 AM, Alan Lawrence wrote: >> >> >> After copyrename3, immediately prior to dom1, the loop body looks like: >> >>: >> >>: >># i_11 = PHI >>_5 = a[i_11]; >>_6 = i_11 & _5; >>if (_6 != 0) >> goto ; >>else >> goto ; >> >>: >> >>: >># m_2 = PHI <5(4), 4(3)> >>_7 = m_2 * _5; >>b[i_11] = _7; >>i_9 = i_11 + 1; >>if (i_9 != 32) >> goto ; >>else >> goto ; >> >>: >>goto ; >> >>: >>return; >> >> dom1 then peeled part of the first loop iteration, producing: > > Yup. The jump threading code realized that if we traverse the edge > 2->3, then we'll always traverse 3->5. The net effect is like peeling > the first iteration because we copy bb3. The original will be reached > via 7->3 (it, loop iterating), the copy via 2->3' and 3' will have its > conditional removed and will unconditionally transfer control to bb5. > > > > This is a known problem, but we really don't have any good heuristics > for when to do this vs when not to do it. > > >> >> In contrast, a slightly-different testcase: >> >> #define N 32 >> >> int a[N]; >> int b[N]; >> >> int foo () >> { >>for (int i = 0; i < N ; i++) >> { >>int cond = (a[i] & i) ? -1 : 0; // extra variable here >>int m = (cond) ? 5 : 4; >>b[i] = a[i] * m; >> } >> } >> >> after copyrename3, just before dom1, is only slightly different: >> >>: >> >>: >># i_15 = PHI >>_6 = a[i_15]; >>_7 = i_15 & _6; >>if (_7 != 0) >> goto ; >>else >> goto ; >> >>: >># m_3 = PHI <4(6), 5(3)> >>_8 = m_3 * _6; >>b[i_15] = _8; >>i_10 = i_15 + 1; >>if (i_10 != 32) >> goto ; >>else >> goto ; >> >>: >>goto ; >> >>: >>return; >> >>: >>goto ; >> >> with bb6 being out-of-line at the end of the function, rather than >> bb4 falling through from just above bb5. However, this prevents dom1 >> from doing the partial peeling, and dom1 only changes the "goto bb7" >> into a "goto bb3": > > I would have still expected it to thread 2->3, 3->6->4 > > >> >> (1) dom1 should really, in the second case, perform the same partial >> peeling that it does in the first testcase, if that is what it thinks >> is desirable. (Of course, we might want to fix that only later, as >> ATM that'd take us backwards). > > Please a file a BZ. It could be something simple, or we might be > hitting one of Zdenek's heuristics around keeping overall loop structure. > >> >> Alternatively, maybe we don't want dom1 doing that sort of thing (?), >> but I'm inclined to think that if it's doing such optimizations, it's >> for a good reason ;) I guess there'll be other times where we >> *cannot* do partially peeling of later iterations... > > It's an open question -- we never reached any kind of conclusion when > it was last discussed with Zdenek. I think the fundamental issue is > we can't really predict when threading the loop is going to interfere > with later optimizations or not. The heuristics we have are marginal at best. > > The one thought we've never explored was re-rolling that first > iteration back into the loop in the vectorizer. >>Well. In this case we hit >>/* If one of the loop header's edge is an exit edge then do not >> apply if-conversion. */ >>FOR_EACH_EDGE (e, ei, loop->header->succs) >> if (loop_exit_edge_p (loop, e)) >> return false; >>which is simply because even after if-conversion we'll at least end up with a >>non-empty latch block which is what the vectorizer doesn't support. >>DOM rotated the loop into this non-canonical form. Running loop header >>copying again would probably undo this. The creation of empty latches with the path-splitting approach where the back edge node can be copied to the predecessors and the Empty latch can be created with the path-splitting approach I have proposed. This will enable the above scenario of vectorization. Thanks & Regards Ajit Richard. > > Jeff
Re: dom1 prevents vectorization via partial loop peeling?
Ajit Kumar Agarwal wrote: -Original Message- From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Richard Biener Sent: Tuesday, April 28, 2015 4:12 PM To: Jeff Law Cc: Alan Lawrence; gcc@gcc.gnu.org Subject: Re: dom1 prevents vectorization via partial loop peeling? On Mon, Apr 27, 2015 at 7:06 PM, Jeff Law wrote: On 04/27/2015 10:12 AM, Alan Lawrence wrote: After copyrename3, immediately prior to dom1, the loop body looks like: : : # i_11 = PHI _5 = a[i_11]; _6 = i_11 & _5; if (_6 != 0) goto ; else goto ; : : # m_2 = PHI <5(4), 4(3)> _7 = m_2 * _5; b[i_11] = _7; i_9 = i_11 + 1; if (i_9 != 32) goto ; else goto ; : goto ; : return; dom1 then peeled part of the first loop iteration, producing: Yup. The jump threading code realized that if we traverse the edge 2->3, then we'll always traverse 3->5. The net effect is like peeling the first iteration because we copy bb3. The original will be reached via 7->3 (it, loop iterating), the copy via 2->3' and 3' will have its conditional removed and will unconditionally transfer control to bb5. This is a known problem, but we really don't have any good heuristics for when to do this vs when not to do it. Ah, yes, I'd not realized this was connected to the jump-threading issue, but I see that now. As you say, the best heuristics are unclear, and I'm not keen on trying *too hard* to predict what later phases will/won't do or do/don't want...maybe if there are simple heuristics that work, but I would aim more at making later phases work with what(ever) they might get??? One (horrible) possibility that I will just throw out (and then duck), is to do something akin to tree-if-conversion's "gimple_build_call_internal (IFN_LOOP_VECTORIZED, " ... In contrast, a slightly-different testcase: >>> [snip] I would have still expected it to thread 2->3, 3->6->4 Ok, I'll look into that. (1) dom1 should really, in the second case, perform the same partial peeling that it does in the first testcase, if that is what it thinks is desirable. (Of course, we might want to fix that only later, as ATM that'd take us backwards). Please a file a BZ. It could be something simple, or we might be hitting one of Zdenek's heuristics around keeping overall loop structure. Alternatively, maybe we don't want dom1 doing that sort of thing (?), but I'm inclined to think that if it's doing such optimizations, it's for a good reason ;) I guess there'll be other times where we *cannot* do partially peeling of later iterations... It's an open question -- we never reached any kind of conclusion when it was last discussed with Zdenek. I think the fundamental issue is we can't really predict when threading the loop is going to interfere with later optimizations or not. The heuristics we have are marginal at best. The one thought we've never explored was re-rolling that first iteration back into the loop in the vectorizer. Yeah, there is that ;). So besides trying to partially-peel the next N iterations, the other approach - that strikes me as sanest - is to finish (fully-)peeling off the first iteration, and then to vectorize from then on. In fact the ideal (I confess I have no knowledge of the GCC representation/infrastructure here) would probably be for the vectorizer (in vect_analyze_scalar_cycles) to look for a point in the loop, or rather a 'cut' across the loop, that avoids breaking any non-cyclic use-def chains, and to use that as the loop header. That analysis could be quite complex tho ;)...and I can see that having peeled the first 1/2 iteration, we may then end up having to peel the next (vectorization factor - 1/2) iterations too to restore alignment! whereas with rerolling ;)...is there perhaps some reasonable way to keep markers around to make the rerolling approach more feasible??? Well. In this case we hit >>/* If one of the loop header's edge is an exit edge then do not >> apply if-conversion. */ >>FOR_EACH_EDGE (e, ei, loop->header->succs) >> if (loop_exit_edge_p (loop, e)) >> return false; which is simply because even after if-conversion we'll at least end up with a non-empty latch block which is what the vectorizer doesn't support. DOM rotated the loop into this non-canonical form. Running loop header copying again would probably undo this. So I've just posted https://gcc.gnu.org/ml/gcc-patches/2015-04/msg01745.html which fixes this limitation of if-conversion. As I first wrote though, the vectorizer still fails, because the PHI nodes incoming to the loop header are neither reductions nor inductions. I'll see if I can run loop header copying again, as you suggest... The creation of empty latches with the path-splitting approach where the back edge node can be copied to the predecessors and the Empty latch can be created with the path-splitting approach I have proposed. This
5.1.0/4.9.2 native mingw64 lto-wrapper.exe issues (PR 65559 and 65582)
I was told I should repost this on this ML rather than the gcc-help list I originally posted this under. Here was my original thread: https://gcc.gnu.org/ml/gcc-help/2015-04/msg00167.html I came across PR 65559 and 65582 while investigating why I was getting the "lto1.exe: internal compiler error: in read_cgraph_and_symbols, at lto/lto.c:2947" error during a native MINGW64 LTO build. This also seems to be present when enabling bootstrap-lto within 5.1.0 presenting an error message akin to what is listed in PR 65582. 1. Under: https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/lto-wrapper.c;h=404cb68e0d1f800628ff69b7672385b88450a3d5;hb=HEAD#l927 lto-wrapper processes command-line params for filenames match (in my case) "./.libs/libspeexdsp.a@0x44e26" and separates the filename from the offset into separate variables. Since the following check to see if that file exists by opening it doesn't use the parsed filename variable and instead continues to use the argv parameter, the attempt to open it always fails and that file is not specifically parsed for LTO options. 2. One other issue I've noticed in my build happens as a result of the open call when trying to parse the options using libiberty. Under mingw64 native, the open call opens the object file in text mode and then passes the fd eventually to libiberty's simple_object_internal_read within simple-object.c. The issue springs up trying to perform a read and it hits a CTRL+Z (0x1A) within the object at which point the next read will return 0 bytes and trigger the break of the loop and a subsequent error message of "file too short" which gets silently ignored. In my testing, changing the 0x1A within the object file to something else returns the full read (or more data until another CTRL+Z is hit). Ref: https://msdn.microsoft.com/en-us/library/wyssk1bs.aspx This still happens within 4.9.2 and 4.9 trunk however in 4.9, the object file being checked for LTO sections is still passed along in the command-line whereas in 5.1.0 it gets skipped but is still listed within the res file most likely leading to the ICE within 65559. This would also explain Kai's comment on why this issue only occurs on native builds. The ICE in 5.1.0 can also be avoided by using an lto-wrapper from 4.9 or prior allowing the link to complete though no LTO options will get processed due to #1. This is my first report so I wouldn't mind some guidance. I'm familiar enough with debugging to gather whatever other level details are requested. Most of this was found using gdb. -- Matt Breedlove
Re: PR65416, alloca on xtensa
On 03/13/2015 06:04 PM, Marc Gauthier wrote: > Other than the required 16-byte stack alignment, there's nothing in > the ABI that requires these extra 16 bytes. Perhaps there was a bad > implementation of the alloca exception handler at some point a long > time ago that prompted the extra 16 bytes? What's the alignment of max_align_t on this architecture? Although it should be possible to get a 16-byte aligned 1-byte object in any (however misaligned) 16-byte window on the stack … -- Florian Weimer / Red Hat Product Security
Running GCC testsuite with --param option (requires space in argument)
Has anyone run the GCC testsuite using a --param option? I am trying to do something like: export RUNTESTFLAGS='--target_board=multi-sim/--param foo=1' make check But the space in the '--param foo=1' option is causing dejagnu to fail. Perhaps there is a way to specify a param value without a space in the option? If there is I could not find it. I tried: export RUNTESTFLAGS='--target_board=multi-sim/--param\ foo=1' export RUNTESTFLAGS='--target_board=multi-sim/--param/foo=1' But neither of those worked either. Steve Ellcey sell...@imgtec.com
Re: Running GCC testsuite with --param option (requires space in argument)
On Tue, Apr 28, 2015 at 01:55:42PM -0700, Steve Ellcey wrote: > Has anyone run the GCC testsuite using a --param option? I am trying > to do something like: > > export RUNTESTFLAGS='--target_board=multi-sim/--param foo=1' > make check > > But the space in the '--param foo=1' option is causing dejagnu to fail. > Perhaps there is a way to specify a param value without a space in the > option? If there is I could not find it. > > I tried: > > export RUNTESTFLAGS='--target_board=multi-sim/--param\ foo=1' > export RUNTESTFLAGS='--target_board=multi-sim/--param/foo=1' Have you tried export RUNTESTFLAGS='--target_board=multi-sim/--param=foo=1' ? Jakub
Re: Running GCC testsuite with --param option (requires space in argument)
On Tue, 2015-04-28 at 22:58 +0200, Jakub Jelinek wrote: > > I tried: > > > > export RUNTESTFLAGS='--target_board=multi-sim/--param\ foo=1' > > export RUNTESTFLAGS='--target_board=multi-sim/--param/foo=1' > > Have you tried > export RUNTESTFLAGS='--target_board=multi-sim/--param=foo=1' > ? > > Jakub Nope, but it seems to work. That syntax is not documented in invoke.texi. I will see about submitting a patch (or at least a documentation bug report). Steve Ellcey
gcc-5-20150428 is now available
Snapshot gcc-5-20150428 is now available on ftp://gcc.gnu.org/pub/gcc/snapshots/5-20150428/ and on various mirrors, see http://gcc.gnu.org/mirrors.html for details. This snapshot has been generated from the GCC 5 SVN branch with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-5-branch revision 222550 You'll find: gcc-5-20150428.tar.bz2 Complete GCC MD5=6068bb8e23caa1172a127026e05ed311 SHA1=7a274cf30fdf3aa1bc347be68787212e1c90ac7d Diffs from 5-20150421 are available in the diffs/ subdirectory. When a particular snapshot is ready for public consumption the LATEST-5 link is updated and a message is sent to the gcc list. Please do not use a snapshot before it has been announced that way.
avr-gcc generating really dumb code
I wrote a small function to convert u8 to hex: // converts 4-bit nibble to ascii hex uint8_t nibbletohex(uint8_t value) { if ( value > 9 ) value += 'A' - '0'; return value + '0'; } // returns value as 2 ascii characters in a 16-bit int uint16_t u8tohex(uint8_t value) { uint16_t hexdigits; uint8_t hidigit = (value >> 4); hexdigits = (nibbletohex(hidigit) << 8); uint8_t lodigit = (value & 0x0F); hexdigits |= nibbletohex(lodigit); return hexdigits; } I compiled it with avr-gcc -Os using 4.8 and 5.1 and got the same code: 007a : 7a: 28 2f mov r18, r24 7c: 22 95 swapr18 7e: 2f 70 andir18, 0x0F ; 15 80: 2a 30 cpi r18, 0x0A ; 10 82: 08 f0 brcs.+2 ; 0x86 84: 2f 5e subir18, 0xEF ; 239 86: 20 5d subir18, 0xD0 ; 208 88: 30 e0 ldi r19, 0x00 ; 0 8a: 32 2f mov r19, r18 8c: 22 27 eor r18, r18 8e: 8f 70 andir24, 0x0F ; 15 90: 8a 30 cpi r24, 0x0A ; 10 92: 08 f0 brcs.+2 ; 0x96 94: 8f 5e subir24, 0xEF ; 239 96: 80 5d subir24, 0xD0 ; 208 98: a9 01 movwr20, r18 9a: 48 2b or r20, r24 9c: ca 01 movwr24, r20 9e: 08 95 ret There's some completely pointless code there, like loading 0 to r19 and immediately overwriting it with the contents of r18 (88, 8a). Other register use is convoluted. The compiler should at least be able to generate the following code (5 fewer instructions): 28 2f mov r18, r24 22 95 swapr18 2f 70 andir18, 0x0F ; 15 2a 30 cpi r18, 0x0A ; 10 08 f0 brcs.+2 ; 0x86 2f 5e subir18, 0xEF ; 239 20 5d subir18, 0xD0 ; 208 32 2f mov r25, r18 8f 70 andir24, 0x0F ; 15 8a 30 cpi r24, 0x0A ; 10 08 f0 brcs.+2 ; 0x96 8f 5e subir24, 0xEF ; 239 80 5d subir24, 0xD0 ; 208 08 95 ret Hand-optimized for size I was able to write it with 3 fewer instructions: .macro addi Rd, K subi \Rd, -(\K) .endm .global u8tohex u8tohex: mov r0, r24 swap r24 rcall nibbletohex ; convert hi digit mov r25, r24 mov r24, r0 ; fall into nibbletohex to convert lo digit ; convert lower nibble of byte to ascii hex char nibbletohex: andi r24, 0x0F cpi r24, 10 brlo under10 addi r24, 'A'-'0'; under10: addi r24, '0' ret I have no intention of learning Gimple and the internals of the gcc back-end, but maybe someone from Atmel can fix this?