Re: s390: SImode pointers vs LR
On 05/30/2015 02:57 AM, DJ Delorie wrote: > In config/s390/s390.c we accept addresses that are SImode: > > if (!REG_P (base) > || (GET_MODE (base) != SImode > && GET_MODE (base) != Pmode)) > return false; > > However, there doesn't seem to be anything in the s390's opcodes that > masks the top half of address registers in 64-bit mode, the SImode > convention seems to just be a convention for addresses in the first > 4Gb. > > So... what happens if gcc uses a subreg to load the lower half of a > register (via LR), leaving the upper half with random bits in it, then > uses that register as an address? I could see no code that checked > for this, and I have a fairly large and ungainly test case that says > it breaks :-( > > My local solution was to just disallow "SImode" as an address in > s390_decompose_address, which forces gcc to do an explicit SI->DI > conversion to clear the upper bits, and it seems to work, but I wonder > if it's the ideal solution... We also have address style operands which are used as shift count operands but never as addresses. Accepting SImode there was necessary to prevent reload from messing things up: https://gcc.gnu.org/ml/gcc-patches/2006-03/msg01495.html I don't know if this is still necessary though. -Andreas-
Re: Better info for combine results in worse code generated
On Mon, Jun 01, 2015 at 11:33:18AM +0930, Alan Modra wrote: > Unifying andsi_mask with anddi_mask, and the fact that constraints for > const_int see VOIDmode rather than the operand mode is why we get > rldicr rather than rlwinm. Easily fixed by separating the si/di > patterns, and with a little more work I may even be able to keep them > together. Maybe just swapping T to be before S will do what you want, already? > In and3 expander I think you want the following since > and64_2_operand covers the extra double-rotate cases, not all DImode. > > - if ((mode == DImode && !and64_2_operand (operands[2], mode)) > - || (mode != DImode && !and_operand (operands[2], mode))) > + if (!and_operand (operands[2], mode) > + && (mode != DImode || !and64_2_operand (operands[2], > mode))) and64_2_operand includes all of and_operand. I agree it is a mess. > In and3_imm_mask_dot and and3_imm_mask_dot2. Typo? > - && any_mask_operand (operands[2], mode)" > + && !any_mask_operand (operands[2], mode)" Thinko; that whole line should just be removed. We prefer e.g. "rlwinm" over "andi.", but "andi." over "rlwinm.". I'll do a patch. > Seems to > me we should omit !logical_const_operand from those insn predicates. I tried to have no patterns overlap (where it matters at all); this helps preserve your sanity if you try to move patterns around in the .md files (first match wins). > > I don't think it is a good idea to optimise code based on assumptions > > of what SImode SETs will do to the dest seen as DImode, without making > > those assumptions explicit in the RTL. > > I agree. Ah, good to hear. > Do you intend to I don't have firm, short-term plans to remove ancient design, no. > get rid of WORD_REGISTER_OPERATIONS, rs6000 should not define it. What e.g. does it mean for mullw? Or, worse, mulhw? Pretty much anything with "w" in its name is problematic. [ What does this macro mean for rotates or right shifts at all? The docs don't say. ] > POINTERS_EXTEND_UNSIGNED, Not related to this. > PUSH_ROUNDING, I would love to see this disappear. But, not related, for rs6000 anyway. > SHORT_IMMEDIATES_SIGN_EXTEND, Not sure about this one. > and LOAD_EXTEND_OP? ;-) With this macro you can just "opt out" for the hard cases. In any case: I am working on a patch that unifies all rotate-and-mask patterns, fixes the rtx_cost for those, and gets rid of 2rld completely. If you really want to disallow most SImode ANDs, it won't be hard to add that, much easier than with the current code. Segher
Re: s390: SImode pointers vs LR
On 06/01/2015 06:23 AM, Andreas Krebbel wrote: On 05/30/2015 02:57 AM, DJ Delorie wrote: In config/s390/s390.c we accept addresses that are SImode: if (!REG_P (base) || (GET_MODE (base) != SImode && GET_MODE (base) != Pmode)) return false; However, there doesn't seem to be anything in the s390's opcodes that masks the top half of address registers in 64-bit mode, the SImode convention seems to just be a convention for addresses in the first 4Gb. So... what happens if gcc uses a subreg to load the lower half of a register (via LR), leaving the upper half with random bits in it, then uses that register as an address? I could see no code that checked for this, and I have a fairly large and ungainly test case that says it breaks :-( My local solution was to just disallow "SImode" as an address in s390_decompose_address, which forces gcc to do an explicit SI->DI conversion to clear the upper bits, and it seems to work, but I wonder if it's the ideal solution... We also have address style operands which are used as shift count operands but never as addresses. Accepting SImode there was necessary to prevent reload from messing things up: https://gcc.gnu.org/ml/gcc-patches/2006-03/msg01495.html I don't know if this is still necessary though. Looks like a hack to me. Sadly, no testcase in that BZ, though presumably since it was a bootstrap failure one could checkout a development tree from that era and see the failure. jeff
Regression: template function accessing a temporary through a lambda
Hi folks, the following C++ snippet used to compile, but doesn't anymore with current gcc trunk. Hope this helps! Thanks, Sebastien Alaiwan - // simple.cpp int getValueOfSix() { return 6; } template void f() { auto const c = getValueOfSix(); auto lambda = [&] () { auto a = c; auto b = c; }; } void g() { f<0>(); } - Here are the errors I get: $ g++ -c -std=c++11 simple.cpp simple.cpp: In instantiation of ‘f():: [with int A = 0]’: simple.cpp:12:19: required from ‘struct f() [with int A = 0]::’ simple.cpp:16:3: required from ‘void f() [with int A = 0]’ simple.cpp:21:8: required from here simple.cpp:15:14: error: redeclaration of ‘const int& c’ auto b = c; ^ simple.cpp:14:14: note: ‘const int& c’ previously declared here auto a = c; ^ simple.cpp: In instantiation of ‘void f() [with int A = 0]’: simple.cpp:21:8: required from here simple.cpp:12:17: sorry, unimplemented: non-trivial designated initializers not supported auto lambda = [&] () ^
Strange insn rtx is emitted in a custom backend
While working on a custom backend for quite a standard RISC-like architecture, I defined 'high'/'lo_sum' patterns as follows: (define_insn "mov_high" [(set (match_operand:SI 0 "register_operand" "=r") (high:SI (match_operand:SI 1 "immediate_operand" "i")))] "" "mvup #high(%1), %0" [(set_attr "insn" "alu") (set_attr "length" "4")] ) (define_insn "add_low" [(set (match_operand:SI 0 "register_operand" "=r") (lo_sum:SI (match_operand:SI 1 "register_operand" "0") (match_operand:SI 2 "symbolic_operand" "")))] (match_operand:SI 2 "immediate_operand" "i")))] "" "add3 %0, #low(%2), %0" [(set_attr "insn" "alu") (set_attr "length" "4")] ) Despite having the patters above (along with usual 'plus' patterns) I'm getting the following rtx emitted at the expansion stage: (insn 53 52 63 2 (set (reg:SI 62) (const:SI (plus:SI (symbol_ref:SI [flags 0x6]) (const_int 64 [0x40] -1 (nil)) which can not be recognized unless I define a pattern like this: (define_insn "*make_gcc_happy" [(set (match_operand:SI 0 "register_operand" "=r") (const:SI (plus:SI (match_operand:SI 1 "symbolic_operand" "") (match_operand:SI 2 "const_int_operand" ""] "" "mvup #high(%1), %0 \\n add3 %0, #low(%1+%2), %0" [(set_attr "insn" "alu") (set_attr "length" "8")] ) I don't understand why the rtx above is emitted as a whole single instruction, without using the high/lo_sum patterns. I'd very appreciate any help, hints or pointing me to a possibly missing/erroneous spots. Sincerely, Lev.
Re: Strange insn rtx is emitted in a custom backend
On 06/01/2015 12:51 PM, Lev Yudalevich wrote: While working on a custom backend for quite a standard RISC-like architecture, I defined 'high'/'lo_sum' patterns as follows: (define_insn "mov_high" [(set (match_operand:SI 0 "register_operand" "=r") (high:SI (match_operand:SI 1 "immediate_operand" "i")))] "" "mvup #high(%1), %0" [(set_attr "insn" "alu") (set_attr "length" "4")] ) (define_insn "add_low" [(set (match_operand:SI 0 "register_operand" "=r") (lo_sum:SI (match_operand:SI 1 "register_operand" "0") (match_operand:SI 2 "symbolic_operand" "")))] (match_operand:SI 2 "immediate_operand" "i")))] "" "add3 %0, #low(%2), %0" [(set_attr "insn" "alu") (set_attr "length" "4")] ) Despite having the patters above (along with usual 'plus' patterns) I'm getting the following rtx emitted at the expansion stage: (insn 53 52 63 2 (set (reg:SI 62) (const:SI (plus:SI (symbol_ref:SI [flags 0x6]) (const_int 64 [0x40] -1 (nil)) which can not be recognized unless I define a pattern like this: You should debug precisely where that came from. You're missing something in your backend to tell GCC that such constants are not valid. You should probably look at how expansion occurs for your tests on other ports that make heavy use of high/lo_sum. sparc & hppa come immediately to mind, but I'm sure there's others. jeff
Re: confirm subscribe to gcc@gcc.gnu.org
Static PIE support in GCC
A feature I've been interested in getting upstream in GCC for a while now is support for producing static-linked PIE executables for Linux. In the model I'm working with, static PIE executables are ET_DYN format with no PT_INTERP, and are intended to contain only relative type relocations (no symbol references). A custom crt1 start file named rcrt1.o is responsible for processing these relative relocations before passing execution to the libc entry point code. I have as part of musl libc an implementation of rcrt1.o (for all targets musl presently supports) that's working for this model. The way it works is completely analogous to what OpenBSD has done in their fork of GCC (see http://www.openbsd.org/papers/asiabsdcon2015-pie-slides.pdf), but aside from adopting the 'r' prefix for crt that they used, which I did for some level of consistency, my work on static PIE has been completely independent of the development of this feature in OpenBSD. While OpenBSD's motivations for static PIE seem to be purely security focused, I'm also interested in static PIE as a form of executable that can be used on NOMMU targets. My motivation for doing the relocations in the start file, rather than with an external program interpreter, is both to reduce runtime cost on very small systems, and to make deployment easier. For musl users, one of the main benefits of static linking is that the resulting binary can be run on systems without any additional runtime files installed. Unfortunately, producing static PIE binaries with GCC is not as simple as passing -static -pie when linking. The linker arguments I've been using to test this so far have been: -shared -Wl,-Bstatic -Wl,-Bsymbolic and adding the rcrt1.o to the beginning of the inputs. This looks like something of a hack, and on the GCC command line I would say it is, at least for -shared which is being used both to suppress the default crt1 file and to produce ET_DYN output without PT_INTERP. On the other hand,-Bstatic is just being used to suppress use of .so files to satisfy -l dependencies, and -Bsymbolic to produce relative relocations in the output instead of symbol references. Thankfully this gets a lot less ugly if you put it in the specs. Just replacing: #define LINK_PIE_SPEC "%{pie:-pie} " with: #define LINK_PIE_SPEC "%{pie:%{static:-shared -Bsymbolic;:-pie}} " causes -static -pie to invoke the linker in a manner which matches the desired static PIE model. Aside from this, a per-target addition to STARTFILE_SPEC is needed to make GCC choose rcrt1.o instead of Scrt1.o when -static is used with -pie, and to change the logic for crtbegin so that -pie's choice of crtbeginS.o overrides -static's crtbeginT.o, and likewise for crtend. Before proposing anything in the way of patches I'd like some feedback on whether this approach is acceptable for upstreaming in GCC. The obvious alternative to the LINK_PIE_SPEC change is making ld accept -static -pie and do "the right thing" on its side, but the startfile changes needed on the GCC side are the same either way. Rich
GNU Tools Cauldron 2015 - Call for Participation
Hello, A reminder about this year's Cauldron (http://gcc.gnu.org/wiki/cauldron2015). The webpage was just updated with a list of accepted talks and BoFs as well as a brief review of the schedule to let you plan your travel. With 19 of very relevant talks and 10 BoFs we will have a busy schedule. We are still able to accept last minute submissions to be scheduled in parallel with other presentations. To register your abstract, send e-mail to tools-cauldron-ad...@googlegroups.com. Please do so early so we can adjust schedule accordingly. If you submitted an abstract but it is not in the list, please let us know, too. If you intend to participate, but not necessarily present, please let us know as soon as possible. This will allow us to plan for space and food. Send a message to tools-cauldron-ad...@googlegroups.com stating your intent to participate. Jan
Re: Better info for combine results in worse code generated
On Mon, Jun 01, 2015 at 08:39:05AM -0500, Segher Boessenkool wrote: > On Mon, Jun 01, 2015 at 11:33:18AM +0930, Alan Modra wrote: > > Unifying andsi_mask with anddi_mask, and the fact that constraints for > > const_int see VOIDmode rather than the operand mode is why we get > > rldicr rather than rlwinm. Easily fixed by separating the si/di > > patterns, and with a little more work I may even be able to keep them > > together. > > Maybe just swapping T to be before S will do what you want, already? Nope. Index: gcc/config/rs6000/predicates.md === --- gcc/config/rs6000/predicates.md (revision 223878) +++ gcc/config/rs6000/predicates.md (working copy) @@ -764,7 +764,11 @@ if (TARGET_POWERPC64) { - /* Fail if the mask is not 32-bit. */ + /* Fail if the mask is not 32-bit. Note: If constraints are +implemented using mask_operand then they will never fail this +test. const_ints are VOIDmode, which is what is seen here +when called from a constraint. When called as a predicate, +the match_operand mode is seen. */ if (mode == DImode && (c & ~(unsigned HOST_WIDE_INT) 0x) != 0) return 0; The above, part of a patch I was writing to fix these problems, is why putting T before S doesn't work. eg. "T" matches 0x8000, which is good for SImode where you're really masking with 0x8000, but using rlwinm for the same constant in DImode would of course mask off the top 32 bits. > > In and3 expander I think you want the following since > > and64_2_operand covers the extra double-rotate cases, not all DImode. > > > > - if ((mode == DImode && !and64_2_operand (operands[2], mode)) > > - || (mode != DImode && !and_operand (operands[2], mode))) > > + if (!and_operand (operands[2], mode) > > + && (mode != DImode || !and64_2_operand (operands[2], > > mode))) > > and64_2_operand includes all of and_operand. I agree it is a mess. but and64_2_operand doesn't include all of and_operand! > > In and3_imm_mask_dot and and3_imm_mask_dot2. Typo? > > - && any_mask_operand (operands[2], mode)" > > + && !any_mask_operand (operands[2], mode)" > > Thinko; that whole line should just be removed. We prefer e.g. "rlwinm" > over "andi.", but "andi." over "rlwinm.". I'll do a patch. OK, I have a patch too.. > > get rid of WORD_REGISTER_OPERATIONS, > > rs6000 should not define it. What e.g. does it mean for mullw? Or, > worse, mulhw? Pretty much anything with "w" in its name is problematic. In many places WORD_REGISTER_OPERATIONS is used, it is saying "don't trust the high bits". At the moment we definitely do need it defined! > and gets rid of 2rld completely. That's a good idea. If we want it still, and I think we do, just implement the two rldicl/r in the expander. -- Alan Modra Australia Development Lab, IBM
Data race in PGO profile collection for multi-process program
Hi, I am trying PGO on Nginx, which has a main process and several worker processes. I find that the collected profile data files only contain information for the main process, which is probably a data race (the main process exits immediately after worker processes exit). How can I solve this problem? Should I switch to the sampling-based AutoFDO? Regards, Yuan, Pengfei
Re: Data race in PGO profile collection for multi-process program
Using AutoFDO is one way. For PGO, you may want to to try using __gcov_dump interface to explicitly control the timing and order of the profile dump --- i.e., invoke __gcov_dump in main process after work processes exit and before the main process exits. David On Mon, Jun 1, 2015 at 8:08 PM, Pengfei Yuan <0xcool...@gmail.com> wrote: > Hi, > > I am trying PGO on Nginx, which has a main process and several worker > processes. > I find that the collected profile data files only contain information > for the main process, which is probably a data race (the main process > exits immediately after worker processes exit). > How can I solve this problem? Should I switch to the sampling-based AutoFDO? > > Regards, > Yuan, Pengfei
Re: Data race in PGO profile collection for multi-process program
Thank you very much for the advice. I will try __gcov_dump first. Yuan 2015-06-02 12:14 GMT+08:00 Xinliang David Li : > Using AutoFDO is one way. For PGO, you may want to to try using > __gcov_dump interface to explicitly control the timing and order of > the profile dump --- i.e., invoke __gcov_dump in main process after > work processes exit and before the main process exits.
g++ regression: template function accessing a temporary through a lambda
Hi folks, the following C++ snippet used to compile, but doesn't anymore with current gcc trunk. Hope this helps! Thanks, Sebastien Alaiwan - // simple.cpp int getValueOfSix() { return 6; } template void f() { auto const c = getValueOfSix(); auto lambda = [&] () { auto a = c; auto b = c; }; } void g() { f<0>(); }
RE: [RFC] Design and Implementation for Path Splitting for Loop with Conditional IF-THEN-ELSE
-Original Message- From: Jeff Law [mailto:l...@redhat.com] Sent: Friday, May 29, 2015 9:24 PM To: Ajit Kumar Agarwal; Richard Biener; gcc@gcc.gnu.org Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [RFC] Design and Implementation for Path Splitting for Loop with Conditional IF-THEN-ELSE On 05/16/2015 06:49 AM, Ajit Kumar Agarwal wrote: > I have Designed and implemented with the following design for the path > splitting of the loops with conditional IF-THEN-ELSE. > The implementation has gone through the bootstrap for Microblaze > target along DEJA GNU regressions tests and running the MIBench/EEMBC > benchmarks. There is no regression seen in Deja GNU tests and Deja C++ > tests results are better with the path splitting changes(lesser > failures). The Mibench/EEMBC benchmarks are run and no performance > degradation is seen and the performance improvement in telcom_gsm( > Mibench benchmarks) of 9.15% and rgbcmy01_lite(EEMBC > benchmarks) the performance improvement of 9.4% is observed. > > Proposal on the below were made earlier on path splitting and the reference > is attached. >>The first thing I notice is you haven't included any tests for the new >>optimization. We're trying really hard these days to have tests in the suite >>for this kind of >>new optimization/feature work. >>I don't offhand know if any of the benchmarks you cite above are free-enough >>to derive a testcase from. But one trick many of us use is to instrument the pass and compile some known free software (often gcc >>itself) to find triggering code and use that to generate tests for the new >>transformation. I will add tests in the suite. I could see many existing tests in the suite also get triggered with this optimization. > > Design changes. > > 1. The dominators of the block with conditional IF statements say BB1 > are found and the join node of the IF-THEN-ELSE inside the loops is found on > the blocks dominated by the BB1 and are not successor of BB1 are the join > node. This isn't necessarily my area of strength, but it doesn't seem right to me. I guess it would work if there aren't any more nodes in the loop after the join block, except for the loop latch, but it doesn't see right in the general case. It might work if you also verify that the IF block is the immediate dominator of potential join node. > > 2. The Join node is same as the source of the loops latch basic blocks. >>OK. This is why your initial algorithm in #1 works. Thanks. > > 3. If the above conditional in (1) and (2) are found the Join node > same as the source of the Loop latch node is moved into predecessors > and the Join node ( Source of the Loop latch node) is made empty statement > block with only the phi nodes. >>Hopefully you do this by duplicating the join node and manipulating the CFG >>so that the original predecessors of the join each go to their copy of the >>join >>node. The copied join nodes unconditionally then pass control to the >>original join node. Yes. > > 4. In the process of moving the Join node into its predecessors the > result of the phi node in the Join node propagated with the corresponding phi > arguments based on which predecessor it came from in the Join blocks and > move into its predecessors. >>Right. Note there's a phi-only cprop pass which is designed to do this >>propagation and in theory should be very fast. Thanks. > 5. The Dominator INFO is updated after performing the steps of (1) (2) (3) > and (4). >>Right. Getting the sequencing of the various updates can be tricky. >>Even more so if you try to do it as a part of jump threading because you have >>to (for example) deal with unreachable blocks in the CFG. > > 6. The Join which is same as the source of the Loop latch node is made empty > with only the phi node in the Join node. >>Right. Thanks. > > 7. The implementation is done in Jump threading phase of the machine > independent optimization on tree based > representation. The path splitting is called after the Jump threading > optimization is performed. > The following files are modifed. > > a) tree-vrp.c > b) tree-ssa-threadedge.c > c) tree-cfg.c > d) tree-ssa-threadedge.h > e) tree-cfg.h > f) cfghooks.c > > The diff is attached along with this mail and pasted below. >>I would recommend this as a separate pass. One of the goals for GCC 6 >>is to break out the threader into its own pass, if you're piggybacking >>on the threader it just makes this task harder. >>Also by implementing this as a distinct pass you have more freedom to >>put it where it really belongs in the optimization pipeline. Given the >>duplication may end up exposing partial redundancies, dead code & >>propagation opportunities, it's probably best placed before DOM. That >>also eliminates the need for running phi-only cprop because DOM will >>take care of it for you. >>When jump threading is