Re: making the new if-converter not mangle IR that is already vectorizer-friendly
Abe wrote: Hi, pleased to meet you :) As some of you already know, at SARC we are working on a new "if converter" to help convert simple "if"-based blocks of code that appear inside loops into an autovectorizer-friendly form that closely resembles the C ternary operator ["c ? x : y"]. GCC already has such a converter, but it is off by default, in part because it is unsafe: if enabled, it can cause certain code to be transformed in such a way that it malfunctions even though the non-converted code worked just fine with the same inputs. The new converter, originally by my teammate Sebastian Pop, is safer [almost-always safe *]; we are working on getting it into good-enough shape that the always-safe transformations can be turned on by default whenever the autovectorizer is on. * Always safe for stores, sometimes a little risky for loads: speculative loads might cause multithreaded programs with insufficient locking to fail due to writes by another thread being "lost"/"missed", even though the same program works OK "by luck" when compiled without if-conversion of loads. This risk comes mainly/only from what the relevant literature calls a "half hammock": an "if" with a "then" section but no "else" section [or effectively vice-versa, e.g. an empty "then" and a non-empty "else"]. In this case, e.g. "if (c) X[x] = Y[y];" with no attached "else" section is risky to fully if-convert in the event of the code being compiled running multithreaded and not having been written with all the locking it really needs. Respectively, e.g. "if (c) ; /* empty ''then'' */ else X[x] = Y[y];". For the unenlightened, can you outline the problem with this code sequence? (i.e. the expected transformation that makes it unsafe!?) I would hope your scratchpad patch would turn this into something like a1 = c ? &Y[y] : &scratch; temp = *a1; a2 = c ? &X[x] : &scratch; *a2 = temp; which seems OK to me - so is the scratchpad approach going away? (The problem that things might be read in a different order *across* the elements of a vector, I can see, but that belongs in the domain of the vectorizer itself, not if-conversion, I would think?) One of the reasons the new if converter has not yet been submitted for incorporation into GCC`s trunk is that it still has some performance regressions WRT the old converter, and most of those are "true regressions", i.e. not just because the old converter was less safe and the additional safety is what is causing the loss, but rather because there is more work to do before the patch is ready. As of this writing, the new if converter sometimes tries to "convert" something that is already vectorizer-friendly, and in doing so it renders that code now-NOT-vectorizer-friendly. Can you give an example? My understanding was that the existing vectorizer bailed out pretty much straightaway if the number of basic blocks in the loop was not exactly 2 (for inner loops) or 5 (for outermost loops, i.e. containing exactly one inner loop)...that seems to rule out vectorization of *any* kind of conditional execution, that the if-converter might convert? Thanks, Alan
Re: making the new if-converter not mangle IR that is already vectorizer-friendly
Abe wrote: In other words, the problem about which I was concerned is not going to be triggered by e.g. "if (c) x = ..." which lacks an attached "else x = ..." in a multithreaded program without enough locking just because 'x' is global/static. The only remaining case to consider is if some code being compiler takes the address of something thread-local and then "gives" that pointer to another thread. Even for _that_ extreme case, Sebastian says that the gimplifier will detect this "address has been taken" situation and do the right thing such that the new if converter also does the right thing. Great :). I don't understand much/anything about how gcc deals with thread-locals, but everything before that, all sounds good... [Alan wrote:] Can you give an example? The test cases in the GCC tree at "gcc.dg/vect/pr61194.c" and "gcc.dg/vect/vect-mask-load-1.c" currently test as: the new if-converter is "converting" something that`s already vectorizer-friendly... > [snip] However, TTBOMK the vectorizer already "understands" that in cases where its input looks like: x = c ? y : z; ... and 'y' and 'z' are both pure [side-effect-free] -- including, but not limited to, they must be non-"volatile" -- it may vectorize a loop containing code like the preceding, ignoring for this particular instance the C mandate that only one of {y, z} be evaluated... My understanding, is that any decision as to whether one or both of y or z is evaluated (when 'evaluation' involves doing any work, e.g. a load), has already been encoded into the gimple/tree IR. Thus, if we are to only evaluate one of 'y' or 'z' in your example, the IR will (prior to if-conversion), contain basic blocks and control flow, that means we jump around the one that's not evaluated. This appears to be the case in pr61194.c: prior to if-conversion, the IR for the loop in barX is : # i_16 = PHI # ivtmp_21 = PHI _5 = x[i_16]; _6 = _5 > 0.0; _7 = w[i_16]; _8 = _7 < 0.0; _9 = _6 & _8; if (_9 != 0) goto ; else goto ; : iftmp.0_10 = z[i_16]; goto ; : iftmp.0_11 = y[i_16]; : # iftmp.0_2 = PHI z[i_16] = iftmp.0_2; i_13 = i_16 + 1; ivtmp_20 = ivtmp_21 - 1; if (ivtmp_20 != 0) goto ; else goto ; : goto ; which clearly contains (unvectorizable!) control flow. Without -ftree-loop-if-convert-stores, if-conversion leaves this alone, and vectorization fails (i.e. the vectorizer bails out because the loop has >2 basic blocks). With -ftree-loop-if-convert-stores, if-conversion produces : # i_16 = PHI # ivtmp_21 = PHI _5 = x[i_16]; _6 = _5 > 0.0; _7 = w[i_16]; _8 = _7 < 0.0; _9 = _6 & _8; iftmp.0_10 = z[i_16]; // <== here iftmp.0_11 = y[i_16]; // <== here iftmp.0_2 = _9 ? iftmp.0_10 : iftmp.0_11; z[i_16] = iftmp.0_2; i_13 = i_16 + 1; ivtmp_20 = ivtmp_21 - 1; if (ivtmp_20 != 0) goto ; else goto ; : goto ; where I have commented the conditional loads that have become unconditional. (Hence, "-ftree-loop-if-convert-stores" looks misnamed - it affects how the if-conversion phase converts loads, too - please correct me if I misunderstand (Richard?) ?!) This contains no control flow, and so is vectorizable. (This is all without your scratchpad patch, of course.) IOW this being vectorized, or not, relies upon the preceding if-conversion phase removing the control flow. HTH Alan
Re: making the new if-converter not mangle IR that is already vectorizer-friendly
Abe wrote: [snip] Predication of instructions can help to remove the burden of the conditional branch, but is not available on all relevant architectures. In some architectures that are typically implemented in modern high-speed processors -- i.e. with high core frequency, caches, speculative execution, etc. -- there is not full support for predication [as there is, for example, in 32-bit ARM] but there _is_ support for "conditional move" [hereinafter "cmove"]. If/when the ISA in question supports cmove to/from main memory, perhaps it would be profitable to use two cmoves back-to-back with opposite conditions and the same register [destination for load, source for store] to implement e.g. "temp = c ? X[x] : Y[y]" and/or "temp = C[c] ? X[x] : Y[y]" etc. Even without cmove to/from main memory, two cmoves back-to-back with opposite conditions could still be used, e.g. [not for a real-world ISA]: load X[x] -> reg1 load Y[y] -> reg2 cmove c ? reg1 -> reg3 cmove (!c) ? reg2 -> reg3 Or even better if the ISA can support something like: load X[x] -> reg1 load Y[y] -> reg2 cmove (c ? reg1 : reg2) -> reg3 However, this is a code-gen issue from my POV, and at the present time all the if-conversion work is occurring at the GIMPLE level. If anybody reading this knows how I could make the if converter generate GIMPLE that leads to code-gen that is better for at least one ISA Ramana writes that your patch for PR46029 generates more 'csel's (i.e. the second form you write) for AArch64: https://gcc.gnu.org/ml/gcc-patches/2015-06/msg01721.html of course this says nothing about whether there is *some* other ISA that gets regressed! HTH Alan
Re: making the new if-converter not mangle IR that is already vectorizer-friendly
Abe wrote: of course this says nothing about whether there is *some* other ISA that gets regressed! After finishing fixing the known regressions, I intend/plan to reg-test for AArch64; after that, I think I`m going to need some community help to reg-test for other ISAs. OK, I'm confused. When you write "making the new if-converter not mangle IR"...does "the new if-converter" mean your scratchpad fix to PR46029, or is there some other new if-conversion phase that you are still working on and haven't posted yet? If the latter, does this replace the existing tree-if-conv.c, or is it an extra stage before that? I haven't yet understood what you mean about "vectorizer-friendly" IR being mangled; is the problem that your new phase transforms IR that can currently be if-converted by the existing phase, into IR that can't? (Example?) Then I might (only "might", sorry!) be able to help... Cheers, Alan
Re: making the new if-converter not mangle IR that is already vectorizer-friendly
Abe wrote: Ideally, I/we fix the above problem -- and the rest of the regressions in the new if converter -- Ok, what are these regressions? You say you've tested on x86_64 and only ifcvt-18.c fails (hence your xfail), so how does one find them? In particular, I remember that "result = condition ? array1[index] : array2[maybe the same index, maybe not]" is being converted too early IMO. IOW, somewhere in GCC an array dereference is being considered as either impure, too-expensive, or both. "array[index]" in C [not in C++!: operator overloading] AFAIK is always pure and is always low-cost whenever the index expression is pure and low-cost. Well, any array access could introduce an extra fault (unless it's dominated by another equivalent (read/write) access)...? --Alan
Re: replacing multiplication by bit shifts, etc
Hi! All compilers can replace multiplication operation by bit shifts and LEA x86 instruction (if current target arch is x86). Can I ask, where in GCC this happens? I can't find it in source code. See gcc/expmed.c, search for "choose_mult_variant". HTH Alan
Re: Results from SPEC2006 FP analysis done at Richard`s request {late July / early August}
Abe wrote: Dear all, Overall, I think the WIP new if converter is holding up relatively well, but there is obviously opportunity to do better, at least if the numbers mean what they look like they mean, i.e. the old converter`s code was fully OK and so is the new one`s. By "fully OK" I mean e.g. no crashing bugs were introduced by the conversion. In the following, all the integers over 1000 are loops-vectorized counts. Interesting, thanks. For what kind of architecture are these - specifically: with/out masked/gathering loads/stores ?? Cheers, Alan
Missing dependency in Ada build?
I've been erratically seeing my Ada builds (bootstrap or even with --disable-bootstrap) on an arm-none-linux-gnueabihf system, with errors like this: g++ -fno-PIE -c -DIN_GCC_FRONTEND -g -O2 -DIN_GCC -fno-exceptions -fno-rtti -fasynchronous-unwind-tables -W -Wall -Wno-narrowing -Wwrite-strings -Wcast-qual -Wmissing-format-attribute -Woverloaded-virtual -Wno-long-long -Wno-variadic-macros -Wno-overlength-strings -fno-common -DHAVE_CONFIG_H -I. -Iada -I../../gcc/gcc -I../../gcc/gcc/ada -I../../gcc/gcc/../include -I../../gcc/gcc/../libcpp/include -I/home/alalaw01/boot_f800d24b/./gmp -I/home/alalaw01/gcc/gmp -I/home/alalaw01/boot_f800d24b/./mpfr -I/home/alalaw01/gcc/mpfr -I/home/alalaw01/gcc/mpc/src -I../../gcc/gcc/../libdecnumber -I../../gcc/gcc/../libdecnumber/dpd -I../libdecnumber -I../../gcc/gcc/../libbacktrace -I/home/alalaw01/boot_f800d24b/./isl/include -I/home/alalaw01/gcc/isl/include -o ada/trans.o -MT ada/trans.o -MMD -MP -MF ada/.deps/trans.TPo ../../gcc/gcc/ada/gcc-interface/trans.c ../../gcc/gcc/ada/gcc-interface/trans.c:67:19: fatal error: sinfo.h: No such file or directory #include "sinfo.h" ^ compilation terminated. (and similarly on a bunch of other files). However, other times builds under "identical" conditions succeed. If I try a single-threaded build (make -j1), I find it sometimes hangs on this line: (cd ada/bldtools/sinfo; gnatmake -q xsinfo ; ./xsinfo sinfo.h) which produces output: = Check for field name consistency OK Check for function consistency OK Check for missing functions OK Check for set procedure consistency OK Check for missing set procedures OK Check pragma Inlines are all for existing subprograms OK Check no pragma Inlines were omitted OK Check references in functions in body OK Check for missing functions in body OK Check Set procedures in body OK Check for missing set procedures in body OK All tests completed successfully, no errors detected = ...but then sits forever at 100% of the CPU. Looking at build/gcc/ada/bldtools/sinfo/sinfo.h, the file has been partially written (but not flushed). In a successful build, that step terminates (no further terminal output), the file is flushed, and next comes mv -f ada/bldtools/sinfo/sinfo.h ada/sinfo.h from where the g++ steps pick it up. So it looks like there are two problems here: (1) xsinfo not terminating; (2) a missing dependency, that the g++ steps should depend upon the step producing sinfo.h (i.e. the mv that comes after problem (1)) I'm going to look at (1), but I'm hoping someone more familiar with the Ada build system might be able to help with regards to (2)? Thanks for any help, Alan
Re: Missing dependency in Ada build?
On 07/10/15 18:33, Eric Botcazou wrote: So it looks like there are two problems here: (1) xsinfo not terminating; (2) a missing dependency, that the g++ steps should depend upon the step producing sinfo.h (i.e. the mv that comes after problem (1)) I'm going to look at (1), but I'm hoping someone more familiar with the Ada build system might be able to help with regards to (2)? Try to replace $(GNAT1_ADA_OBJS) with $(GNAT1_OBJS) in the line: # When building from scratch we don't have dependency files, the only thing # we need to ensure is that the generated files are created first. $(GNAT1_ADA_OBJS) $(GNATBIND_OBJS): | $(ada_generated_files) in ada/gcc-interface/Make-lang.in. Yes, that fixes the dependency problem; the build waits for the xsinfo to terminate. This at least makes it easy to Ctrl-C and try again. Are we aware of any reason why *not* to make that change? The hang in xsinfo appears to be miscompilation when the host compiler is gnat-4.8 (on ARM), whereas gnat-4.6 succeeds. I'm going to try again with later versions (4.9/5.0/etc.)... Thanks, Alan
Re: Missing dependency in Ada build?
On 12/10/15 10:58, Eric Botcazou wrote: Yes, that fixes the dependency problem; the build waits for the xsinfo to terminate. This at least makes it easy to Ctrl-C and try again. Are we aware of any reason why *not* to make that change? You may have noticed that I had made it before you posted this message. :-) Ah, right. You give me far too much credit ;) The hang in xsinfo appears to be miscompilation when the host compiler is gnat-4.8 (on ARM), whereas gnat-4.6 succeeds. I'm going to try again with later versions (4.9/5.0/etc.)... Thanks. This could be worth documenting somewhere then. I can confirm that gnat-4.9 works OK (as well as gnat-4.6). I wonder whether bugzilla is still the right place to document this, given 4.8 is now end-of-lifed...do you know of anywhere appropriate? Thanks, Alan
Tree CONSTRUCTORs and Ada array indices
Hi, I'm having fun with arrays in Ada ;) and wondering if anyone can tell me what's right here. Ada ACATS c64106a contains code with an array type whose indices run from -1 to 1. (That is, three elements.) In GCC this gets represented as an ARRAY_TYPE, whose TYPE_DOMAIN is a 32-bit signed integer type; the TYPE_MAX_VALUE of this is 1 (as a size_int), the TYPE_MIN_VALUE of this is -1 as a size_int, i.e. 4294967295. I believe using (unsigned 32-bit) size_int for min/max is correct (?) despite the domain being a signed type. An array of this type is then initialized with a CONSTRUCTOR. The CONSTRUCTOR has three elements, with INTEGER_CST indices, being in order 2^32-1, 0, 1 (all of sizetype). I substitute the CONSTRUCTOR into an ARRAY_REF [4294967295]{lb: 4294967295 sz: 2}, i.e. that picks the element with index (sizetype)2^32-1. fold() in fold-const.c then fails to fold, because it does binary search for 2^32-1 through the constructor elements, and first checks against the middle CONSTRUCTOR_ELT, which has index 0. So, where's the bug here? Should all these indices have sizetype? Should fold-const.c's binary search respect the TYPE_DOMAIN of the type of the CONSTRUCTOR being searched? (Lots of build_int_cst to reinterpret the CONSTRUCTOR_ELT indices, and/or the index being searched for, in said TYPE_DOMAIN ?) Advice appreciated! Thanks, Alan
RS6000/PowerPC -mpaired
Can anyone give me any pointers on how to use the -mpaired option ("PowerPC 750CL paired-single instructions") ? I'm trying to get it to use the reduc_s{min,max,plus}_v2sf patterns in gcc/config/rs6000/paired.md, so far I haven't managed to find a configuration that supports loading a vector of floats but that doesn't also have superior altivec/v4sf support... Thanks, Alan
Re: Prototype implementation: Improving effectiveness and generality of auto-vectorization
On Tues, Oct 27, 2015 at 2:39 PM, Richard Biener wrote: On Mon, Oct 26, 2015 at 6:59 AM, sameera wrote: Richard, we have defined the input language for convenience in prototype implementation. However, we will be using GIMPLE as our IR. As per grammar of our tree, p-tree denote the permute order associated with the statement, whereas c-tree is actually the GIMPLE instruction, which performs compute operation. I tried looking at structures used in SLP, however they can not be used as they are, as main difference between current SLP implementation in GCC versus our representation is that, permute order in SLP is part of the tree node in current GCC, whereas in our representation permute order is represented as independent tree-node. Hence, I have created new tree structure for our pass, which will create p-tree nodes for permute order, and c-tree node which points to appropriate gimple statement. Yes, that's the whole purpose - get the vectorizer (and SLP) a better data structure which is explicit about permutes. I somewhat miss an idea on how to deal with irregular SLP cases - that is, does the AST model each (current) SLP node as a single statement or does it model individual vector elements (so you can insert no-op compensation code to make the operations match). Consider a[i] = b[i] * 3; a[i+1] = b[i+1]; which can be vectorized with SLP if you realize you can multiply b[i+1] by 1. The pass structure we are having for our optimization is as follows: - New pass target-aware-loop-vect with following subpasses: - Loop structure identification - Restricted loop construction - Loop analysis (6 phases of target-aware reordering-instruction selection algorithm) - Loop transformation (Cost aware gimple code generation) We might need some optimizations to be performed on loop body before we start our optimization for reordering instruction selection, so that input program can be transformed to fit into our restricted loop structure. These transformations can be performed by restricted loop construction subpass. Eg.: multi-dimentional array, nested loops, reduction chains etc. can be supported by transforming input loop. The case that you have mentioned, can be transformed at this level. As of implementing the whole thing in GCC I recently spent some more time thinking and came to the conclusion that the path to the fastest improvements in GCC code-gen would be to build the new AST after the current analysis phase finished, do the optimization and drive code-generation off the new AST. Thus keep things like data-ref and dependence analysis as is, as well as group analysis and SLP tree construction. Build the AST from the vectorizer "IL" at the point vect_transform_loop / vect_slp_transform_bb is called. Current pass structure that we have designed, does exactly that. The only difference is that, we are planning to use the transform phase as a co-process of our transform pass, than reconstruct an AST from our representation to pass to vect_transform_loop. More and more of the rest of the vectorizer code can then be "slowly" moved to work on the AST and AST construction be moved earlier. So I think an incremental approach is much more likely to get us some benefit, and I'm wondering how much of the benefit here stems from which bit, and how many of these changes can be broken off. For example, making permutes into explicit operations on the SLP tree, and adding a pass that tries to push them up/down the tree into each other, might be done on the existing codebase, and would be a significant win in itself (bringing some/much(?) of improvements in interleaving code of your proposal). I'm not really sure how much of the benefit stems from what, though - e.g. does reducing vector length down to fit the target machine only after the other steps, bring any benefits besides raising abstraction? (Which certainly can be a benefit!). AFAICT this could be done independently of other changes? Another limitation of the present SLP structure is it's not understanding sharing (until later CSE, hence costing wrongly); it would be good to fix this (making the SLP tree into a DAG, and potentially reusing subtrees by adding permutes). This is a step we'd like to take anyway, but might tie in with your "Bottom-up commoning of p-node subtrees". And would help with (BB as well as loop) cost calculations. Are there any really fundamental differences that we cannot overcome in steps? ( / are there any steps we shouldn't take in isolation, because we'd have to 'throw them away' later?) Alan
combine_simplify_rtx (doesn't) commute XOR and ASHIFTRT ???
I'm looking at git commit ea1ac559 / svn r76965, January 2014 (archive: https://gcc.gnu.org/ml/gcc-patches/2004-01/msg03406.html), which prevents (ashiftrt (xor A C1) C2) from being commuted to (xor (ashiftrt A C2) (ashiftrt C1 C2)) and wondering if anyone can explain to me what's wrong with this transformation. Having worked through all four cases of A and C1 positive and negative, it seems to me that the extra bits 'fed in' to the most-significant end of the result are the same either way (i.e. the XOR of the sign bits of A and C1). Re-enabling this transformation improves rtx simplification and codegen on AArch64, for one. (I don't have easy access to Alpha/VMS machine on which to test Ada s-crc32.adb.) Thanks for any info, Alan
Re: combine_simplify_rtx (doesn't) commute XOR and ASHIFTRT ???
Thanks for the quick response - the possibility of different modes seems plausible, as I'm not sure what would happen in that case. (Although I'm not entirely sure how that case would occur - something that may have changed since the original.) I haven't had much luck in getting a working toolchain from 2004, so I've gone with the approach of preserving the check if result_mode != shift_mode, as per patch https://gcc.gnu.org/ml/gcc-patches/2014-06/msg02426.html . Thanks! --Alan Richard Kenner wrote: and wondering if anyone can explain to me what's wrong with this transformation. Having worked through all four cases of A and C1 positive and negative, it seems to me that the extra bits 'fed in' to the most-significant end of the result are the same either way (i.e. the XOR of the sign bits of A and C1). I haven't heard from Kenner in a while, but you could always try to contact him directly and see if he can recall the issue. It was a long time ago though... I haven't gone anywhere ... But indeed ten years is a long time and I don't have any recollection of this at all. The above analysis seems right, but this isn't the sort of thing I'd have done if there weren't some sort of bug. Perhaps it's only relevant if result_mode != shift_mode?
dom1 prevents vectorization via partial loop peeling?
Hi, I've been experimenting with some small testcases with the aim of getting the vectorizer to work on more loops. I started by compiling at -O3 the following testcase: #define N 32 int a[N]; int b[N]; int foo () { for (int i = 0; i < N ; i++) { int m = (a[i] & i) ? 5 : 4; b[i] = a[i] * m; } } After copyrename3, immediately prior to dom1, the loop body looks like: : : # i_11 = PHI _5 = a[i_11]; _6 = i_11 & _5; if (_6 != 0) goto ; else goto ; : : # m_2 = PHI <5(4), 4(3)> _7 = m_2 * _5; b[i_11] = _7; i_9 = i_11 + 1; if (i_9 != 32) goto ; else goto ; : goto ; : return; dom1 then peeled part of the first loop iteration, producing: : goto ; : # i_11 = PHI _5 = a[i_11]; _6 = i_11 & _5; if (_6 != 0) goto ; else goto ; : : # m_14 = PHI <5(4), 4(3)> : # m_2 = PHI # _15 = PHI <_5(5), _10(8)> # i_16 = PHI _7 = m_2 * _15; b[i_16] = _7; i_9 = i_16 + 1; if (i_9 != 32) goto ; else goto ; : return; : # i_1 = PHI <0(2)> _10 = a[i_1]; _3 = i_1 & _10; goto ; which following ivcanon, had been simplified down to: : _10 = a[0]; goto ; : _5 = a[i_9]; _6 = _5 & i_9; if (_6 != 0) goto ; else goto ; : : # m_14 = PHI <5(3), 4(4)> : # m_2 = PHI # _15 = PHI <_5(5), _10(2)> # i_16 = PHI # ivtmp_13 = PHI _7 = m_2 * _15; b[i_16] = _7; i_9 = i_16 + 1; ivtmp_3 = ivtmp_13 - 1; if (ivtmp_3 != 0) goto ; else goto ; : return; tree-if-conversion wouldn't deal with such a loop, and so the control flow made the vectorizer bail out too; resulting in scalar code (on both x86_64 and AArch64). I am currently testing a patch to tree-if-conv.c that makes it work on such a CFG, producing: : _10 = a[0]; goto ; : : # m_2 = PHI # _15 = PHI <_5(3), _10(2)> # i_16 = PHI # ivtmp_13 = PHI _7 = m_2 * _15; b[i_16] = _7; i_9 = i_16 + 1; ivtmp_3 = ivtmp_13 - 1; _5 = a[i_9]; _6 = _5 & i_9; m_14 = _6 != 0 ? 5 : 4; if (ivtmp_3 != 0) goto ; else goto ; : return; However, this still fails to vectorize: the PHI nodes at the start of bb 4 prevent this (in vect_analyze_scalar_cycles: they look a bit like reductions, but aren't). In contrast, a slightly-different testcase: #define N 32 int a[N]; int b[N]; int foo () { for (int i = 0; i < N ; i++) { int cond = (a[i] & i) ? -1 : 0; // extra variable here int m = (cond) ? 5 : 4; b[i] = a[i] * m; } } after copyrename3, just before dom1, is only slightly different: : : # i_15 = PHI _6 = a[i_15]; _7 = i_15 & _6; if (_7 != 0) goto ; else goto ; : # m_3 = PHI <4(6), 5(3)> _8 = m_3 * _6; b[i_15] = _8; i_10 = i_15 + 1; if (i_10 != 32) goto ; else goto ; : goto ; : return; : goto ; with bb6 being out-of-line at the end of the function, rather than bb4 falling through from just above bb5. However, this prevents dom1 from doing the partial peeling, and dom1 only changes the "goto bb7" into a "goto bb3": : : # i_15 = PHI _6 = a[i_15]; _7 = i_15 & _6; if (_7 != 0) goto ; else goto ; : # m_3 = PHI <4(6), 5(3)> _8 = m_3 * _6; b[i_15] = _8; i_10 = i_15 + 1; if (i_10 != 32) goto ; else goto ; : return; : goto ; Existing tree_if_conversion deals with this just fine, feeding into the vectorizer: : : # i_15 = PHI # ivtmp_12 = PHI _6 = a[i_15]; _7 = _6 & i_15; m_3 = _7 != 0 ? 5 : 4; _8 = m_3 * _6; b[i_15] = _8; i_10 = i_15 + 1; ivtmp_1 = ivtmp_12 - 1; if (ivtmp_1 != 0) goto ; else goto ; : goto ; Which is vectorized OK (the only PHI is the induction variable i_15, a true scalar cycle). So, besides the if conversion, I see two issues here: (1) dom1 should really, in the second case, perform the same partial peeling that it does in the first testcase, if that is what it thinks is desirable. (Of course, we might want to fix that only later, as ATM that'd take us backwards). (2) somehow, gcc should vectorize loops that have been partially peeled in that way. The only way I can really see, is to partially peel the same portion of the first (vectorization factor - 1) loop iterations too, so we can do a PHI of a whole vector at a time, and so I'm wondering if anyone can give me any pointers here - am I barking up the right tree - and is it reasonable to persuade existing vectorizer loop-peeling code (e.g. for alignment) to do this for us too, or would anyone recommend a different avenue? Alternatively, maybe we don't want dom1 doing that sort of thing (?), but I'm inclined to think that if it's doing such optimizations, it's for a good reason ;) I guess there'll be other times where we *cannot* do partially peeling of later iterations... Thoughts? Thanks, Alan
Re: dom1 prevents vectorization via partial loop peeling?
Ajit Kumar Agarwal wrote: -Original Message- From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Richard Biener Sent: Tuesday, April 28, 2015 4:12 PM To: Jeff Law Cc: Alan Lawrence; gcc@gcc.gnu.org Subject: Re: dom1 prevents vectorization via partial loop peeling? On Mon, Apr 27, 2015 at 7:06 PM, Jeff Law wrote: On 04/27/2015 10:12 AM, Alan Lawrence wrote: After copyrename3, immediately prior to dom1, the loop body looks like: : : # i_11 = PHI _5 = a[i_11]; _6 = i_11 & _5; if (_6 != 0) goto ; else goto ; : : # m_2 = PHI <5(4), 4(3)> _7 = m_2 * _5; b[i_11] = _7; i_9 = i_11 + 1; if (i_9 != 32) goto ; else goto ; : goto ; : return; dom1 then peeled part of the first loop iteration, producing: Yup. The jump threading code realized that if we traverse the edge 2->3, then we'll always traverse 3->5. The net effect is like peeling the first iteration because we copy bb3. The original will be reached via 7->3 (it, loop iterating), the copy via 2->3' and 3' will have its conditional removed and will unconditionally transfer control to bb5. This is a known problem, but we really don't have any good heuristics for when to do this vs when not to do it. Ah, yes, I'd not realized this was connected to the jump-threading issue, but I see that now. As you say, the best heuristics are unclear, and I'm not keen on trying *too hard* to predict what later phases will/won't do or do/don't want...maybe if there are simple heuristics that work, but I would aim more at making later phases work with what(ever) they might get??? One (horrible) possibility that I will just throw out (and then duck), is to do something akin to tree-if-conversion's "gimple_build_call_internal (IFN_LOOP_VECTORIZED, " ... In contrast, a slightly-different testcase: >>> [snip] I would have still expected it to thread 2->3, 3->6->4 Ok, I'll look into that. (1) dom1 should really, in the second case, perform the same partial peeling that it does in the first testcase, if that is what it thinks is desirable. (Of course, we might want to fix that only later, as ATM that'd take us backwards). Please a file a BZ. It could be something simple, or we might be hitting one of Zdenek's heuristics around keeping overall loop structure. Alternatively, maybe we don't want dom1 doing that sort of thing (?), but I'm inclined to think that if it's doing such optimizations, it's for a good reason ;) I guess there'll be other times where we *cannot* do partially peeling of later iterations... It's an open question -- we never reached any kind of conclusion when it was last discussed with Zdenek. I think the fundamental issue is we can't really predict when threading the loop is going to interfere with later optimizations or not. The heuristics we have are marginal at best. The one thought we've never explored was re-rolling that first iteration back into the loop in the vectorizer. Yeah, there is that ;). So besides trying to partially-peel the next N iterations, the other approach - that strikes me as sanest - is to finish (fully-)peeling off the first iteration, and then to vectorize from then on. In fact the ideal (I confess I have no knowledge of the GCC representation/infrastructure here) would probably be for the vectorizer (in vect_analyze_scalar_cycles) to look for a point in the loop, or rather a 'cut' across the loop, that avoids breaking any non-cyclic use-def chains, and to use that as the loop header. That analysis could be quite complex tho ;)...and I can see that having peeled the first 1/2 iteration, we may then end up having to peel the next (vectorization factor - 1/2) iterations too to restore alignment! whereas with rerolling ;)...is there perhaps some reasonable way to keep markers around to make the rerolling approach more feasible??? Well. In this case we hit >>/* If one of the loop header's edge is an exit edge then do not >> apply if-conversion. */ >>FOR_EACH_EDGE (e, ei, loop->header->succs) >> if (loop_exit_edge_p (loop, e)) >> return false; which is simply because even after if-conversion we'll at least end up with a non-empty latch block which is what the vectorizer doesn't support. DOM rotated the loop into this non-canonical form. Running loop header copying again would probably undo this. So I've just posted https://gcc.gnu.org/ml/gcc-patches/2015-04/msg01745.html which fixes this limitation of if-conversion. As I first wrote though, the vectorizer still fails, because the PHI nodes incoming to the loop header are neither reductions nor inductions. I'll see if I can run loop header copying again, as you sug
Re: dom1 prevents vectorization via partial loop peeling?
Richard Biener wrote: Well. In this case we hit /* If one of the loop header's edge is an exit edge then do not apply if-conversion. */ FOR_EACH_EDGE (e, ei, loop->header->succs) if (loop_exit_edge_p (loop, e)) return false; which is simply because even after if-conversion we'll at least end up with a non-empty latch block which is what the vectorizer doesn't support. DOM rotated the loop into this non-canonical form. Running loop header copying again would probably undo this. Indeed - header copying causes the loop to be if-converted, and then actually vectorized, too. (Well, I had to make do_while_loop_p a bit more aggressive in saying things weren't do-while loops.) I merely moved pass_ch to immediately after pass_dominator, a few stages later; I have also tried cloning it, but do you foresee any problems with just moving it? (It passes bootstrap, I'm running the full testsuite atm). Cheers, Alan