SoC Project: Propagating array data dependencies from Tree-SSA to RTL
Melnik: http://gcc.gnu.org/ml/gcc-patches/2005-11/msg01518.html -- Alexander Monakov
Re: SoC Project: Propagating array data dependencies from Tree-SSA to RTL
On Sun, 25 Mar 2007, Daniel Berlin wrote: Ayal has not signed up to be a mentor (as of yet). If he doesn't, i'd be happy to mentor you here, since i wrote part of tree-data-ref.c Thanks, I'll be very glad to work with you. On Mon, 26 Mar 2007, Ayal Zaks wrote: Sorry, I fear I may have too little time to devote to this; plus, it would be very useful to start with a good understanding of tree-data-ref from which to start propagating the dep info. Vladimir Yanovsky and I will be able to help when it comes to what/how to feed the modulo scheduler. Thank you for your attention. I hope I will have a chance to ask you for help in the frame of GSoC project. -- Alexander Monakov
Re: [PATCH, v3] wwwdocs: e-mail subject lines for contributions
On Mon, 3 Feb 2020, Richard Earnshaw (lists) wrote: > I've not seen any follow-up to this version. Should we go ahead and adopt > this? Can we please go with 'committed' (lowercase) rather than all-caps COMMITTED? Spelling this with all-caps seems like a recent thing on gcc-patches, before everyone used the lowercase version, which makes more sense (no need to shout about the thing that didn't need any discussion before applying the patch). Also, while tools like 'git format-patch' will automatically put [PATCH] in the subject, for '[COMMITTED]' it will be the human typing that out, and it makes little sense to require people to meticulously type that out in caps. Especially when the previous practice was opposite. Thanks. Alexander
Re: [PATCH, v3] wwwdocs: e-mail subject lines for contributions
On Mon, 3 Feb 2020, Richard Earnshaw (lists) wrote: > Upper case is what glibc has, though it appears that it's a rule that is not > strictly followed. If we change it, then it becomes another friction point > between developer groups. Personally, I'd leave it as is, then turn a blind > eye to such minor non-conformance. In that case can we simply say that both 'committed' and 'COMMITTED' are okay, if we know glibc doesn't follow that rule and anticipate we will not follow it either? Thanks. Alexander
Re: Missed optimization with endian and alignment independent memory access on x64
On Thu, 6 Feb 2020, Moritz Strübe wrote: > Why is this so hard optimize? As it's quite a common pattern I'd expect that > there would be at least some hand-coded special case optimizer. (This isn't > criticism - I'm honestly curious.) Or is there a reason gcc shouldn't optimize > this / Why it doesn't matter that this is missed? The compiler would need to exploit the fact that signed overflow is undefined, or deduce it cannot happen. Imagine what happens in a more general case if i is INT_MAX (so without undefined overflow i+1 would be INT_MIN): int f(unsigned char *ptr, int i) { return ptr[i] | ptr[i+1] << 8; } With 64-bit address space this might access two bytes 4GB apart. But you're right that it's a missed optimization in GCC, so you can file it to the GCC Bugzilla. > Is there a way to write such code that gcc optimizes? Simply write a function that accepts one pointer: int load_16be(unsigned char *ptr) { return ptr[0] << 8 | ptr[1]; } and use it as load_16be(data+i) or load_16be(&data[i]). > From a performance point of view: If I actually need two consecutive bytes, > wouldn't it be better to load them as word and split them at the register > level? The question is not entirely clear to me, but usually the answer is that it depends on the microarchitecture and details of the computations that need to be done with loaded values. Often you'd need more than one instruction to "split" the wide load, so it wouldn't be profitable. Alexander
Re: Branch instructions that depend on target distance
On Mon, 24 Feb 2020, Andreas Schwab wrote: > On Feb 24 2020, Petr Tesarik wrote: > > > On Mon, 24 Feb 2020 12:29:40 +0100 > > Andreas Schwab wrote: > > > >> On Feb 24 2020, Petr Tesarik wrote: > >> > >> > This works great ... until there's some inline asm() statement, for > >> > which gcc cannot keep track of the length attribute, so it is probably > >> > taken as zero. > >> > >> GCC computes it by counting the number of asm insns. You can use > >> ADJUST_INSN_LENGTH to adjust this as needed. > > > > Hmm, that's interesting, but does it work for inline asm() statements? > > Yes, for a suitable definition of work. > > > The argument is essentially a free-form string (with some > > substitution), and the compiler cannot know how many bytes they occupy. > > That's why ADJUST_INSN_LENGTH can adjust it. I think Petr might be unaware of the fact that GCC counts the **number of instructions in an inline asm statement** by counting separators in the asm string. This may overcount when a separator appears in a string literal for example, but triggering under-counting is trickier. Petr, please see https://gcc.gnu.org/onlinedocs/gcc/Size-of-an-asm.html for some more discussion. Alexander
GSoC topic: precise lifetimes in GIMPLE
Hi, following the conversation in PR 90348, I wonder if it would make sense to suggest the idea presented there as a potential GSoC topic? Like this: **Enhance GIMPLE IR to represent lifetimes explicitly** At the moment, GCC internal representation GIMPLE lacks precise lifetime information for addressable variables: GIMPLE marks the end of the lifetime by the so-called "GIMPLE clobber" statement, corresponding to the point where the variable goes out of scope in the original program. However, the event of the "birth" of a variable (where it appears in scope) is lost, making the IR ambiguous and opening the door for invalid optimizations, as shown in bug #90348. The project would entail inventing a way to represent "lifetime start" in GIMPLE, adjusting front-ends to emit it, implementing a verifier to check that optimizations do not move references outside of the variable's lifetime, and potentially enhancing optimizations to move lifetime markers, expanding the lifetime, where necessary. I know we already have good project ideas, and I suspect this idea may be too complicated for GSoC, but on the other hand it sounds useful, and gives an "experimental" topic that may be interesting for students. What do you think? Thanks. Alexander
Re: GSoC topic: precise lifetimes in GIMPLE
On Mon, 2 Mar 2020, Richard Biener wrote: > PR90348 is certainly entertaining. But I guess for a GSoC project > we need a more elaborate implementation plan. The above suggesting > of a "lifetime start" is IMHO a no-go btw. Instead I think the > only feasible way is to make all references indirect and thus > make both "allocation" and "deallocation" points explicit. Then > there's a data dependence on the "allocation" statement which > provides upward safety and the "deallocation" statement would > need to act as a barrier in some to be determined way. That is, > how to make optimizers preserve the lifetime end is still unsolved. I think a verifier that ensures that all references are dominated by "lifetime start" and post-dominated by clobbers/lifetime-end would be a substantial step towards that. Agreed that data dependence on allocation would automatically ensure part of that verification, but then the problem with deallocation remains, as you say. > IMHO whatever we do should combine with the CLOBBERs we have now, > not be yet another mechanism. This seems contradictory with the ideas in your previous paragraph. I agree though, CLOBBER-as-lifetime-end makes sense and does not call for a replacement. Thanks. Alexander
Clarifying attribute-const
Hello, I'd like to ask for community input regarding __attribute__((const)) (and "pure", where applicable). My main goal is to clarify unclear cases and improve existing documentation, if possible. First, a belated follow-up to https://gcc.gnu.org/PR66512 . The bug is asking why attribute-const appears to have a weaker effect in C++, compared to C. The answer in that bug is that GCC assumes that attribute-const function can terminate by throwing an exception. That doesn't actually seem reasonable. Consider that C counterpart to throwing is longjmp; it seems to me that GCC should behave consistently: either assume that attribute-const may both longjmp and throw (I guess nobody wants that), or that it may not longjmp nor throw. Intuitively, if "const" means "free of side effects so that calls can be moved speculatively or duplicated", then non-local control flow transfer via throwing should be disallowed as well. In any case, it would be nice the intended compiler behavior could be explicitely stated in the manual. Second, there is an interesting mismatch between documentation and existing usage. Among most prominent users of the attribute there are two glibc functions: __errno_location(void) and pthread_self(void). Both return a pointer to thread-local storage, so the functions are not "const" globally in a multi-threaded process. A sufficiently advanced compiler can cause the following testcase to abort: #include #include static void *errno_pointer; static void *thr(void *unused) { errno_pointer = &errno; return 0; } int main() { errno_pointer = &errno; pthread_t t; pthread_create(&t, 0, thr, 0); pthread_join(t, 0); if (errno_pointer == &errno) abort(); } (errno_pointer is static, so the compiler can observe that it does not escape the translation unit, and all stores in the TU assign the same "const" value) Does GCC need to be concerned about eventually "miscompiling" such cases? If not, can we document an explicit promise that attribute-const may include pointers-to-TLS? Thanks. Alexander
Re: Clarifying attribute-const
On Fri, 25 Sep 2015, Eric Botcazou wrote: > > First, a belated follow-up to https://gcc.gnu.org/PR66512 . The bug is > > asking why attribute-const appears to have a weaker effect in C++, compared > > to C. The answer in that bug is that GCC assumes that attribute-const > > function can terminate by throwing an exception. > > FWIW there is an equivalent semantics in Ada: the "const" functions can throw > and the language explicitly allows them to be CSEd in this case, etc. Can you expand on the "etc." a bit, i.e., may the compiler ... - move a call to a "const" function above a conditional branch, causing a conditional throw to happen unconditionally? - move a call to a "const" function below a conditional branch, causing an unconditional throw to happen only conditionally? - reorder calls to "const" functions w.r.t. code with side effects, or other throwing functions? (all of the above in the context of Ada) Thanks. Alexander
Re: nonnull, -Wnonnull, and do/while
On Tue, 16 Feb 2016, Marek Polacek wrote: > Well, it's just that "s" has the nonnull attribute so the compiler thinks it > should never be null in which case comparing it to null should be redundant. > Doesn't seem like a false positive to me, but maybe someone else feels > otherwise. Please look at the posted code again: static void f(const char *s) { do { printf("%s\n",s); s = NULL; } while (s != NULL); } Since 's' is assigned to, the constraint from 'printf' is no longer useful for warning at the point of comparison. It clearly looks like a false positive to me. Alexander
Re: who owns stack args?
On Wed, 24 Feb 2016, DJ Delorie wrote: > The real question is: are stack arguments call-clobbered or > call-preserved? Does the answer depend on the "pure" attribute? Stack area holding stack arguments should belong to the callee for tail-calls to work (the callee will trash that area when laying out arguments for the tail call; thanks to Rich Felker for pointing that out to me). Thus it cannot depend on attribute-pure. Alexander
GCC Bugzilla whines broken?
Hello, Can anyone quickly confirm whether "whining" feature in the GCC Bugzilla is supposed to be functioning at the moment? The lastest thread I could find indicates that it is actually supposed to be working: https://gcc.gnu.org/ml/gcc/2010-09/msg00569.html . However I've tried to setup a whine for myself a week ago, and it never produced the emails. Actually, I want a different feature than whining: notifications for bugs matching a certain predicate, e.g. for a specific target; ideally being automatically Cc'ed to such bugs, with an option to un-cc myself if needed. I can somewhat emulate that with whine searches restricted to "last N days". Is anybody doing something like that? Thanks. Alexander
Re: out of bounds access in insn-automata.c
Hi, On Thu, 24 Mar 2016, Bernd Schmidt wrote: > On 03/24/2016 11:17 AM, Aldy Hernandez wrote: > > On 03/23/2016 10:25 AM, Bernd Schmidt wrote: > > > It looks like this block of code is written by a helper function that is > > > really intended for other purposes than for maximal_insn_latency. Might > > > be worth changing to > > > int insn_code = dfa_insn_code (as_a (insn)); > > > gcc_assert (insn_code <= DFA__ADVANCE_CYCLE); > > > > dfa_insn_code_* and friends can return > DFA__ADVANCE_CYCLE so I can't > > put that assert on the helper function. > > So don't use the helper function? Just emit the block above directly. Let me chime in :) The function under scrutiny, maximal_insn_latency, was added as part of selective scheduling merge; at the same time, output_default_latencies was factored out of output_internal_insn_latency_func, and the pair of new functions output_internal_maximal_insn_latency_func/output_maximal_insn_latency_func tried to mirror existing pair of output_internal_insn_latency_func/output_insn_latency_func. In particular, output_insn_latency_func also invokes output_internal_insn_code_evaluation (twice, for each argument). This means that generated 'insn_latency' can also call 'internal_insn_latency' with DFA__ADVANCE_CYCLE in arguments. However, 'internal_insn_latency' then has a specially emitted 'if' statement that checks if either of the arguments is ' >= DFA__ADVANCE_CYCLE', and returns 0 in that case. So ultimately pre-existing code was checking ' > DFA__ADVANCE_CYCLE' first and ' >= DFA_ADVANCE_CYCLE' second (for no good reason as far as I can see), and when the new '_maximal_' functions were introduced, the second check was not duplicated in the new copy. So as long we are not looking for hacking it up further, I'd like to clean up both functions at the same time. If calling the 'internal_' variants with DFA__ADVANCE_CYCLE is rare, extending 'default_insn_latencies' by 1 zero element corresponding to DFA__ADVANCE_CYCLE is a simple suitable fix. If either DFA__ADVANCE_CYCLE is not guaranteed to be rare, or extending the table in that style is undesired, I suggest creating a variant of 'output_internal_insn_code_evaluation' that performs a '>=' rather than '>' test in the first place, and use it in both output_insn_latency_func and output_maximal_insn_latency_func. If acknowledged, I volunteer to regstrap on x86_64 and submit that in stage1. Thoughts? Thanks. Alexander
[PATCH] clean up insn-automata.c (was: Re: out of bounds access in insn-automata.c)
On Wed, 30 Mar 2016, Bernd Schmidt wrote: > On 03/25/2016 04:43 AM, Aldy Hernandez wrote: > > If Bernd is fine with this, I'm happy to retract my patch and any > > possible followups. I'm just interested in having no path causing a > > possible out of bounds access. If your patch will do that, I'm cool. > > I'll need to see that patch first to comment :-) Here's the proposed patch. I've found that there's only one user of the current fancy logic in output_internal_insn_code_evaluation: handling of NULL_RTX and const0_rtx is only useful for 'state_transition' (generated by output_trans_func), so it's possible to inline the extended handling there, and handle only plain non-null rtx_insns in output_internal_insn_code_evaluation. This change allows to remove extra checks and casting in output_internal_insn_latency_func, as done by the patch. As a nice bonus, it constrains prototypes of three automata functions to accept insn_rtx rather than just rtx. Bootstrapped and regtested on x86_64, OK? Thanks. Alexander * genattr.c (main): Change 'rtx' to 'rtx_insn *' in prototypes of 'insn_latency', 'maximal_insn_latency', 'min_insn_conflict_delay'. * genautomata.c (output_internal_insn_code_evaluation): Simplify. Move handling of non-insn arguments inline into the sole user: (output_trans_func): ...here. (output_min_insn_conflict_delay_func): Change 'rtx' to 'rtx_insn *' in emitted function prototype. (output_internal_insn_latency_func): Ditto. Simplify. (output_internal_maximal_insn_latency_func): Ditto. Delete always-unused argument. (output_insn_latency_func): Ditto. (output_maximal_insn_latency_func): Ditto. diff --git a/gcc/genattr.c b/gcc/genattr.c index 656a9a7..77380e7 100644 --- a/gcc/genattr.c +++ b/gcc/genattr.c @@ -240,11 +240,11 @@ main (int argc, const char **argv) printf ("/* Insn latency time on data consumed by the 2nd insn.\n"); printf (" Use the function if bypass_p returns nonzero for\n"); printf (" the 1st insn. */\n"); - printf ("extern int insn_latency (rtx, rtx);\n\n"); + printf ("extern int insn_latency (rtx_insn *, rtx_insn *);\n\n"); printf ("/* Maximal insn latency time possible of all bypasses for this insn.\n"); printf (" Use the function if bypass_p returns nonzero for\n"); printf (" the 1st insn. */\n"); - printf ("extern int maximal_insn_latency (rtx);\n\n"); + printf ("extern int maximal_insn_latency (rtx_insn *);\n\n"); printf ("\n#if AUTOMATON_ALTS\n"); printf ("/* The following function returns number of alternative\n"); printf (" reservations of given insn. It may be used for better\n"); @@ -290,8 +290,8 @@ main (int argc, const char **argv) printf ("state_transition should return negative value for\n"); printf ("the insn and the state). Data dependencies between\n"); printf ("the insns are ignored by the function. */\n"); - printf - ("extern int min_insn_conflict_delay (state_t, rtx, rtx);\n"); + printf ("extern int " + "min_insn_conflict_delay (state_t, rtx_insn *, rtx_insn *);\n"); printf ("/* The following function outputs reservations for given\n"); printf (" insn as they are described in the corresponding\n"); printf (" define_insn_reservation. */\n"); diff --git a/gcc/genautomata.c b/gcc/genautomata.c index dcde604..92c8b5c 100644 --- a/gcc/genautomata.c +++ b/gcc/genautomata.c @@ -8113,14 +8113,10 @@ output_internal_trans_func (void) /* Output code - if (insn != 0) -{ - insn_code = dfa_insn_code (insn); - if (insn_code > DFA__ADVANCE_CYCLE) -return code; -} - else -insn_code = DFA__ADVANCE_CYCLE; + gcc_checking_assert (insn != 0); + insn_code = dfa_insn_code (insn); + if (insn_code >= DFA__ADVANCE_CYCLE) +return code; where insn denotes INSN_NAME, insn_code denotes INSN_CODE_NAME, and code denotes CODE. */ @@ -8129,21 +8125,12 @@ output_internal_insn_code_evaluation (const char *insn_name, const char *insn_code_name, int code) { - fprintf (output_file, "\n if (%s == 0)\n", insn_name); - fprintf (output_file, "%s = %s;\n\n", - insn_code_name, ADVANCE_CYCLE_VALUE_NAME); - if (collapse_flag) -{ - fprintf (output_file, "\n else if (%s == const0_rtx)\n", insn_name); - fprintf (output_file, "%s = %s;\n\n", - insn_code_name, COLLAPSE_NDFA_VALUE_NAME); -} - fprintf (output_file, "\n else\n{\n"); - fprintf (output_file, - " %s = %s (as_a (%s));\n", - insn_code_name, DFA_INSN_CODE_FUNC_NAME, insn_name); - fprintf (output_file, " if (%s > %s)\nreturn %d;\n}\n", - insn_code_name, ADVANCE_CYCLE_VALUE_NAME, code); + fprintf (output_file, " gcc_checking_assert
Re: [RFD] Extremely large alignment of read-only strings
On Wed, 27 Jul 2016, Thorsten Glaser wrote: First of all, I think option -malign-data=abi (new in GCC 5) addresses your need: it can be used to reduce the default (excessive) alignment to just the psABI-dictated value (you can play with this at https://gcc.godbolt.org even if you can't install GCC-5 locally). Note that like with other ABI-affecting options you need to consider implications for linking with code you're not building yourself: if the other code expects bigger alignment, you'll have a bug. One comment to your email below. > After some (well, lots) more debugging, I eventually > discovered -fdump-translation-unit (which, in the version > I was using, also worked for C, not just C++), which showed > me that the alignment was 256 even (only later reduced to > 32 as that’s the maximum alignment for i386). Most likely the quoted figures from GCC dumps are in bits, not bytes. HTH Alexander
Re: [libgomp] No references to env.c -> no libgomp construction
Hello, On Tue, 29 Nov 2016, Sebastian Huber wrote: > * env.c: Split out ICV definitions into... > * icv.c: ...here (new file) and... > * icv-device.c: ...here. New file. > > the env.c contains now only local symbols (at least for target *-rtems*-*): > [...] > > Thus the libgomp constructor is not linked in into executables. Thanks for the report. This issue affects only static libgomp.a (and not on NVPTX where env.c is deliberately empty). I think the minimal solution here is to #include from icv.c instead of compiling it separately (using <> inclusion rather than "" so in case of NVPTX we pick up the empty config/nvptx/env.c from toplevel icv.c). A slightly more involved but perhaps a preferable approach is to remove config/nvptx/env.c, introduce LIBGOMP_OFFLOADED_ONLY macro, and use it to guard inclusion of env.c from icv.c (which then can use the #include "env.c" form). Thanks. Alexander
Re: How to avoid constant propagation into functions?
[adding gcc@ for the compiler-testsuite-related discussion, please drop either gcc@ or gcc-help@ from Cc: as appropriate in replies] On Wed, 7 Dec 2016, Segher Boessenkool wrote: > > For example, this might have impact on writing test for GCC: > > > > When I am writing a test with noinline + noclone then my > > expectation is that no such propagation happens, because > > otherwise a test might turn trivial... > > The usual ways to prevent that are to add some volatile, or an > asm("" : "+g"(some_var)); etc. No, that doesn't sound right. As far as I can tell from looking that the GCC testsuite, the prevailing way is actually the noinline+noclone combo, not the per-argument asms or volatiles. This behavior is new in gcc-7 due to new IPA-VRP functionality. So -fno-ipa-vrp gets the old behavior. I think from the testsuite perspective the situation got a bit worse due to this, as now in existing testcases stuff can get propagated where the testcase used noinline+noclone to suppress propagation. This means that some testcases may get weaker and no longer test what they were supposed to. And writing new testcases gets less convenient too. However, this actually demonstrates how the noinline+noclone was not future-proof, and in a way backfired now. Should there be, ideally, a single 'noipa' attribute encompassing noinline, noclone, -fno-ipa-vrp, -fno-ipa-ra and all future transforms using inter-procedural knowledge? Alexander
Re: How to avoid constant propagation into functions?
On Wed, 7 Dec 2016, Richard Biener wrote: > >Agreed, that's what I've been using in the past for glibc test cases. > > > >If that doesn't work, we'll need something else. Separate compilation > >of test cases just to thwart compiler optimizations is a significant > >burden, and will stop working once we have LTO anyway. > > > >What about making the function definitions weak? Would that be more > >reliable? > > Adding attribute((used)) should do the trick. It introduces unknown callers > and thus without cloning disables IPA. Hm, depending on the case I think this may be not enough: it thwarts IPA on the callee side, but still allows the compiler to optimize the caller: for example, deduce that callee is pure/const (in which case optimizations in the caller may cause it to be called fewer times than intended or never at all), apply IPA-RA, or (perhaps in future) deduce that the callee always returns non-NULL and optimize the caller accordingly. I think attribute-weak works to suppress IPA on the caller side, but that is not a good solution because it also has an effect on linking semantics, may be not available on non-ELF platforms, etc. Alexander
Re: How to avoid constant propagation into functions?
On Fri, 9 Dec 2016, Richard Biener wrote: > Right, 'used' thwarts IPA on the callee side only. noclone and noinline are > attributes affecting the caller side but we indeed miss attributes for the > properties you mention above. I suppose adding a catch-all attribute for > caller side effects (like we have 'used' for the callee side) would be a good > idea. For general uses, i.e. for testcases that ought to be portable across different compilers, I believe making a call through a volatile pointer already places a sufficient compiler barrier to prevent both caller- and callee-side analysis. That is, where you have int foo(int); foo(arg); you could transform it to int foo(int); int (* volatile vpfoo)(int) = foo; vpfoo(arg); While this also has an effect of forcing the call to be indirect, I think usually that should be acceptable. But for uses in the gcc testsuite, I believe an attribute is still needed. Alexander
Re: [RFC] noipa attribute (was Re: How to avoid constant propagation into functions?)
On Thu, 15 Dec 2016, Jakub Jelinek wrote: > So here is a proof of concept of an attribute that disables inlining, > cloning, ICF, IPA VRP, IPA bit CCP, IPA RA, pure/const/throw discovery. > Does it look reasonable? Anything still missing? I'd like to suggest some additions to the extend.texi entry: > --- gcc/doc/extend.texi.jj2016-12-15 11:26:07.0 +0100 > +++ gcc/doc/extend.texi 2016-12-15 12:19:32.738996533 +0100 > @@ -2955,6 +2955,15 @@ asm (""); > (@pxref{Extended Asm}) in the called function, to serve as a special > side-effect. > > +@item noipa > +@cindex @code{noipa} function attribute > +Disable interprocedural optimizations between the function with this > +attribute and its callers, as if the body of the function is not available > +when optimizing callers and the callers are unavailable when optimizing > +the body. This attribute implies @code{noinline}, @code{noclone}, > +@code{no_icf} and @code{used} attributes and in the future might > +imply further newly added attributes. 1. I believe the last sentence should call out that the effect of this attribute is not reducible to just the existing attributes, because suppression of IPA-RA and pure/const discovery is not expressible that way, and that is actually intended. Can this be added to clarify the intent: However, this attribute is not equivalent to a combination of other attributes, because its purpose is to suppress existing and future optimizations employing interprocedural analysis, including those that do not have an attribute suitable for disabling them individually. (and perhaps remove ' ... and in the future might imply ...' from the quoted snippet, because the clarification makes it partially redundant) 2. Can we gently suggest to readers of documentation that this was invented for use in the GCC testsuite, and encourage them to seek proper alternatives, e.g.: This attribute is exposed for the purpose of testing the compiler. In general it should be preferable to properly constrain code generation using the language facilities: for example, using separate compilation or calling through a volatile pointer achieves a similar effect in a portable way [ except in case of a sufficiently advanced compiler indistinguishable from an adversary ;) ] Thanks. Alexander
LTO remapping/deduction of machine modes of types/decls
Hello, Richard, Jakub, community, May I join/restart the old discussion about machine mode remapping at LTO stream-in time. To recap, when offloading to NVPTX was introduced, there was a problem due to differences in the set of supported modes (e.g. there was no 'XFmode' on NVPTX that would correspond to 'long double' tree type node in GIMPLE LTO streams produced by x86 host compiler). The current solution in GCC is to additionally stream a 'mode table' and use it to remap numeric mode identifiers during LTO stream-in in all trees that have modes. This is the solution initially outlined by Jakub in the message https://gcc.gnu.org/ml/gcc-patches/2015-02/msg00226.html . In response to that, Richard said, > I think (also communicated that on IRC) we should instead try not streaming > machine-modes at all but generating them at stream-in time via layout_type > or layout_decl. and later in the thread also: > I'm just looking for a way to make this less of a hack (and the LTO IL > less target dependent). Not for GCC 5 for which something like your > patch is probably ok, but for the future. Now that we're in the future, I've asked Vlad Ivanishin (Cc'ed) to try and implement Richard's approach. The motivation is enhancing LTO for offloaded code, in particular to expose library code for inlining. In that scenario, the current scheme has a problem that WPA can arbitrarily mix LTO sections coming from libraries (where the modes don't need remapping) and LTO sections produced by the host compiler. Thus, mode_table would need to be only selectively applied during stream-in, based on the origin of the section. And, we'd need to ensure that WPA duplicates mode tables across all ltrans units. In light of that, I felt that trying Richard's approach would be proper. Actually, I don't know why gimple/tree representation carries machine modes in the first place; it seems to be redundant information deducible from type information. Vlad's current patch is adding mode deduction for types and decls, matches the deduced mode against the streamed-in mode, and ICEs in case of mismatch. To be clear, he's checking this for native LTO via lto-bootstrap, but nevertheless it's a nice way of giving confidence that mode inference works as intended. This seems to be fine for C, but in C++ we are seeing some hard-to-explain cases where the deduced BLKmode for 7-byte-sized/4-byte-aligned base-class decl is mismatching against deduced DImode. The DImode would be obviously correct for 8-byte-sized decl of the corresponding type, but the base class decl does not have 1 byte of padding in the tail. What's worse, the issue is just for the mode of the decl: the mode of the type is BLKmode, as we'd expect. Unfortunately, just adjusting the C++ frontend to place BLKmode on the decl too doesn't lead to immediate success, because decl modes have implications for debug info generation, and the compiler starts ICE'ing there instead. So we're hitting under-documented places in the compiler here, and personally I don't have the confidence to judge how they're intended to work. Basically for now my questions are: 1. Is there an intended invariant that decl modes should match type modes? It appears that if there was, the above situation with C++ base objects would be a violation. 2. Do you think we should continue digging in this direction? I'm not sure how much it'd help a discussion, but for completeness Vlad's current patchset is provided as attachments. Patch 1/3 adds mode inference for types (only), patch 2 just reverts Jakub's additions of mode_table handling, and finally patch 3 adds mode inference for decls, adds checking against streamed-in modes, and shows where the attempted adjustments in the C++ frontend and debug info generation were. There are a few coding style violations; sorry; I hope they are not too distracting. Thanks. AlexanderFrom 58ad9d4d75cbc057c003c701ff3f0e6b8fa35e39 Mon Sep 17 00:00:00 2001 From: Vladislav Ivanishin Date: Tue, 13 Dec 2016 14:58:26 +0300 Subject: [PATCH 1/3] Infer modes from types after LTO streaming-in * gcc/lto/lto.c: New function lto_infer_mode () which calls ... * gcc/stor-layout.c: ... the new function set_mode_for_type (). * gcc/stor-layout.h: Declare set_mode_for_type (). --- gcc/lto/lto.c | 20 + gcc/stor-layout.c | 127 ++ gcc/stor-layout.h | 2 + 3 files changed, 149 insertions(+) diff --git a/gcc/lto/lto.c b/gcc/lto/lto.c index 6718fbbe..cec54e3 100644 --- a/gcc/lto/lto.c +++ b/gcc/lto/lto.c @@ -1656,6 +1656,25 @@ unify_scc (struct data_in *data_in, unsigned from, return unified_p; } +static void +lto_infer_mode (tree type) +{ + if (!TYPE_P (type)) +return; + + if (!COMPLETE_TYPE_P (type) && TYPE_MODE (type) == VOIDmode) +return; + + /* C++ FE has complex logic for laying out classes. We don't have + the information here to reproduce the decision process (nor do we + w
Re: LTO remapping/deduction of machine modes of types/decls
On Mon, 2 Jan 2017, Jakub Jelinek wrote: > In my view mode is essential part of the type system. It (sadly, but still) > participates in many ABI decisions, but more importantly especially for > floating point types it is the main source of information of what the type > actually is, as just size and precision are nowhere near enough. > The precision/size isn't able to carry information like whether the type is > decimal or binary floating, what padding it has and where, what NaN etc. > conventions it uses. So trying to throw away modes and reconstruct them > looks conceptually wrong to me. I wonder if it's possible to have a small tag in tree types to distinguish between binary/decimal/fixed-point types. For prototyping, I was thinking about just looking at the type name, because non-ieee-binary types have an easily recognizable prefix. For padding and NaN variability, can you point me on which targets the modes affect that? The "Machine Modes" chapter in the documentation doesn't give a hint (IFmode/KFmode are not documented there either). Alternatively, is reconstructing all modes necessary in the first place? On tree level GCC has explicit trees for all fundamental types like float_type_node. Is it possible to remap just those trees? Modes of composite types should be deducible, and modes of decls may be deducible from their types (not sure; why do decls have modes separately from their types, anyway?) > One can also just use > float __attribute__((mode (XFmode))) or float __attribute__((mode (TFmode))) > or float __attribute__((mode (KFmode))) or IFmode etc., how do you want to > differentiate between those? And I don't see how this can help with the > long double stuff for NVPTX offloading. If user uses 80-bit long double > (or mode(XFmode) floats/doubles) in his source, then as PTX only has SFmode > and DFmode (perhaps also HFmode?), the only way to get it working is through > emulation (whether soft-fp, or writing some emulation using double, > whatever). Pretending long double on the host is DFmode on the PTX side > just won't work, they have different representation. (yes, PTX spec has half floats, but GCC doesn't implement those on PTX today, and thus doesn't have HFmode now) For attribute-mode, I wasn't aware of KFmode/IFmode stuff; wherever the modes affect semantics without leaving any other trace in the type, leaving out the mode loses information. So either one keeps the modes, or adds sufficient tagging in the type tree. For long double, I think offloading to PTX should have the following semantics: size/alignment of long double matches those on host. Otherwise, storage layout of composite types won't match, and that's really bad. But otherwise long double is the same as double on PTX (so for offloading from x86-64 it has 64 bits of padding). This means that long double data is not transferable between accelerator and host, but otherwise gives the most sane semantics I can imagine. I think this implies that XFmode/TFmode don't need to exist on NVPTX to back long_double_type_node. Thanks. Alexander
Re: LTO remapping/deduction of machine modes of types/decls
On Mon, 2 Jan 2017, Jakub Jelinek wrote: > If the host has long double the same as double, sure, PTX can use its native > DFmode even for long double. But otherwise, the storage must be > transferable between accelerator and host. Hm, sorry, the 'must' is not obvious to me: is it known that the OpenMP ARB would find only this implementation behavior acceptable? Apart from floating-point types, are there other situations where modes carry information not deducible from the rest of the tree node? Thanks. Alexander
Re: LTO remapping/deduction of machine modes of types/decls
On Mon, 2 Jan 2017, Jakub Jelinek wrote: > On Mon, Jan 02, 2017 at 09:49:55PM +0300, Alexander Monakov wrote: > > On Mon, 2 Jan 2017, Jakub Jelinek wrote: > > > If the host has long double the same as double, sure, PTX can use its > > > native > > > DFmode even for long double. But otherwise, the storage must be > > > transferable between accelerator and host. > > > > Hm, sorry, the 'must' is not obvious to me: is it known that the OpenMP ARB > > would find only this implementation behavior acceptable? > > long double is not non-mappable type in the spec, so it is supposed to work. > The implementation may choose not to offload whenever it sees long > double/__float128/_Float128/_Float128x etc. But this is not something the implementation can properly enforce; consider long double v; char buf[sizeof v]; #pragma omp target map(from:buf) sscanf ("1.0", "%Lf", buf); memcpy(&v, buf, sizeof v); The offloading compiler wouldn't see a 'long double' anywhere, it gets brought in at linking stage. So the implementation would have to tag individual translation units and see only in the end of linking if the offloaded image touches a long double datum anywhere. And as the example shows, it would prevent using printf-like functions. Alexander
Re: TARGET_MACRO_FUSION_PAIR for something besides compare-and-branch ?
On Wed, 28 May 2014, Kyrill Tkachov wrote: > Hi all, > > The documentation for TARGET_MACRO_FUSION_PAIR says that it can be used to > tell the scheduler that two insns should not be scheduled apart. It doesn't > specify what kinds of insns those can be. > > Yet from what I can see in sched-deps.c it can only be used on compares and > conditional branches, as implemented in i386. Please note that it's not only restricted to conditional branches, but also to keeping the instructions together if they were consecutive in the first place (i.e. it does not try to move a compare insn closer to the branch). Doing it that way allowed to solve the issue at hand at that time without a separate scan of the whole RTL instruction stream. > Say I want to specify two other types of instruction that I want to force > together, would it be worth generalising the TARGET_MACRO_FUSION_PAIR usage > to achieve that? I'd say yes, but that would be the least of the problems; the more important question is how to trigger the hook (you probably want to integrate it into the existing scheduler dependencies evaluation loop rather than adding a new loop just to discover macro-fusable pairs). You'll also have to invent something new if you want to move non-consecutive fusable insns together if they are apart. HTH. Alexander
Re: Branch taken rate of Linux kernel compiled with GCC 4.9
On Tue, 13 Jan 2015, Pengfei Yuan wrote: > I use perf with rbf88:k,rff88:k events (Haswell specific) to profile > the taken rate of conditional branches in the kernel. Here are the > results: [...] > > The results are very strange because all the taken rates are greater > than 50%. Why not reverse the basic block reordering heuristics to > make them under 50%? Is there anything wrong with GCC? Your measurement includes the conditional branches at the end of loop bodies. When loops iterate, those branches are taken, and it doesn't make sense to reverse them. HTH Alexander
Re: Failure to dlopen libgomp due to static TLS data
There's a pending patch for glibc that addresses this issue among others: https://sourceware.org/ml/libc-alpha/2014-11/msg00469.html ([BZ#17090/17620/17621]: fix DTV race, assert, and DTV_SURPLUS Static TLS limit) Alexander
Re: Android native build of GCC
> Given that info...and in spite of my aforementioned limited knowledge I > went back to take another look at the source and found this in > libfakechroot.c > > /bld/fakechrt/fakechroot-2.16 $ grep -C 4 dlsym src/libfakechroot.c > /* Lazily load function */ > LOCAL fakechroot_wrapperfn_t fakechroot_loadfunc (struct fakechroot_wrapper * > w) > { > char *msg; > if (!(w->nextfunc = dlsym(RTLD_NEXT, w->name))) {; > msg = dlerror(); > fprintf(stderr, "%s: %s: %s\n", PACKAGE, w->name, msg != NULL ? msg : > "unresolved symbol"); > exit(EXIT_FAILURE); > } > > I'm fairly certain I remember reading something about Android and lazy > function loadinghow it doesn't handle it well or does so differently > from standard Linux builds. At any rate, I believe the above code is > responsible for those annoying 'fakechroot: undefined reference to dlopen' > errors, so I'll see if I can fix that. In Android's Bionic libc, the implementation of dlopen() resides in the dynamic loader, and not present in libdl.so. So to obtain the pointer to dlopen the code like above can use dlsym(RTLD_DEFAULT, "dlopen"), but not RTLD_NEXT (the loader precedes the fakeroot library in the lookup chain). The preceding discussion seems to have libc and libdl switched. Normally the implementation of dlopen is found in libdl.so, but not in libc.so. Hope that helps, Alexander
Re: Android native build of GCC
On Sun, 15 Feb 2015, Cyd Haselton wrote: > On Sun, Feb 15, 2015 at 12:41 PM, Cyd Haselton wrote: > > *snip* > > > >> So to obtain the pointer to > >> dlopen the code like above can use dlsym(RTLD_DEFAULT, "dlopen"), but not > >> RTLD_NEXT (the loader precedes the fakeroot library in the lookup chain). > >> > *snip* > > Just a quick update: RTLD_DEFAULT is definitely not the solution here > as it results in a segfault when libfakechroot loads. Perhaps a > different RTLD_FLAG was meant? I think you need to use RTLD_DEFAULT only when resolving "dlopen", and keep RTLD_NEXT for all other symbol names (unless later on you run into other symbols with similar behavior). Something like ... = dlsym(strcmp(name, "dlopen") ? RTLD_NEXT : RTLD_DEFAULT, name); Alexander
Re: Android native build of GCC
On Sun, 15 Feb 2015, Cyd Haselton wrote: > On Sun, Feb 15, 2015 at 11:53 AM, Alexander Monakov > wrote: > >> Given that info...and in spite of my aforementioned limited knowledge I > >> went back to take another look at the source and found this in > >> libfakechroot.c > >> > >> /bld/fakechrt/fakechroot-2.16 $ grep -C 4 dlsym src/libfakechroot.c > >> /* Lazily load function */ > >> LOCAL fakechroot_wrapperfn_t fakechroot_loadfunc (struct > >> fakechroot_wrapper * w) > >> { > >> char *msg; > >> if (!(w->nextfunc = dlsym(RTLD_NEXT, w->name))) {; > >> msg = dlerror(); > >> fprintf(stderr, "%s: %s: %s\n", PACKAGE, w->name, msg != NULL ? > >> msg : "unresolved symbol"); > >> exit(EXIT_FAILURE); > >> } > >> > >> I'm fairly certain I remember reading something about Android and lazy > >> function loadinghow it doesn't handle it well or does so differently > >> from standard Linux builds. At any rate, I believe the above code is > >> responsible for those annoying 'fakechroot: undefined reference to dlopen' > >> errors, so I'll see if I can fix that. > > > > In Android's Bionic libc, the implementation of dlopen() resides in the > > dynamic loader, and not present in libdl.so. > > Yet in Android's NDK documentation, they state that in order to use > dlopen() functionality in native code you must link against libdl and > include dlfcn.h. Why would this be the case if the dlopen() > implementation is not in libdl? > (documentation link: http://www.kandroid.org/ndk/docs/STABLE-APIS.html) That's the standard way of using dlopen, i.e. same as you would do it on Linux with glibc for example. So the link merely says that you can get dlopen the same way as usual. The difference is that Android's libdl only contains stub symbols for dlopen&co, and the real symbols can be looked up in the dynamic loader. That RTLD_NEXT does not work for obtaining a pointer for dlopen, as it works on glibc, is quite unfortunate, and probably a bug in Bionic. Alexander
Broken test gcc.target/i386/sibcall-2.c
Hello, Last year's x86 sibcall improvements added a currently xfailed test: /* { dg-do compile { target ia32 } } */ /* { dg-options "-O2" } */ extern int doo1 (int); extern int doo2 (int); extern void bar (char *); int foo (int a) { char s[256]; bar (s); return (a < 0 ? doo1 : doo2) (a); } /* { dg-final { scan-assembler-not "call\[ \t\]*.%eax" { xfail *-*-* } } } */ It was xfailed by https://gcc.gnu.org/ml/gcc-patches/2014-06/msg00016.html Can you tell me what the test is supposed to test? A tail call is impossible here, because 'bar' might save the address of 's' in a global variable, and therefore 's' must be live when 'doo1' or 'doo2' are invoked. Should we remove or unbreak this test? Thanks. Alexander
Re: Broken test gcc.target/i386/sibcall-2.c
Ah. I realize it's most likely for testing sibcall_[value]_pop_memory peepholes, right? In which case the testcase might look like this: /* { dg-do compile } */ /* { dg-options "-O2" } */ void foo (int a, void (**doo1) (void), void (**doo2) (void)) { char s[16] = {0}; do s[a] = 1; while (a &= a-1); (*(s[8] ? doo1 : doo2)) (); } /* { dg-final { scan-assembler-not "call" } } */ However on the above testcase memory-indirect jump is currently generated only for 64-bit x86. With -mx32 it's impossible, but with -m32 the peephole doesn't match. Is that expected? Can you also tell me why ..._pop call and sibcall instructions are predicated on !TARGET_64BIT? Thanks. Alexander
Re: May 2015 Toolchain Update
Hello, A couple of comments below. On Mon, 18 May 2015, Nick Clifton wrote: > val |= ~0 << loaded;// Generates warning > val |= (unsigned) ~0 << loaded; // Does not warn To reduce verbosity, '~0u' can be used here instead of a cast. > * GCC supports a new option: -fno-plt > > When compiling position independent code this tells the compiler > not to use PLT for external function calls. Instead the address > is loaded from the GOT and then branched to directly. This > leads to more efficient code by eliminating PLT stubs and > exposing GOT load to optimizations. > > Not all architectures support this option, and some other > optimization features, such as lazy binding, may disable it. The last paragraph looks confusing to be on both points. '-fno-plt' is implemented as a transformation during TreeSSA-to-RTL expansion, so it works in a machine-independent manner; it's a no-op only if the target has no way to turn on '-fPIC'. Is that what you meant? Second, lazy binding is not an optimization feature of GCC (it's implemented as part of (e.g. glibc's) dynamic linker), so it's not quite right to say that -fno-plt would be disabled by it. Text I've added to the documentation says: Lazy binding requires PLT: with -fno-plt all external symbols are resolved at load time. Thus, for code compiled with -fno-plt the dynamic linker would not be able to perform lazy binding (even if it was otherwise possible, e.g. -z now -z relro weren't in effect, and profitable, i.e. the library was not already prelinked). Alexander
Re: RFC: Creating a more efficient sincos interface
On Thu, 13 Sep 2018, Wilco Dijkstra wrote: > What do people think? Ideally I'd like to support this in a generic way so > all targets can > benefit, but it's also feasible to enable it on a per-target basis. Also > since not all libraries > will support the new interface, there would have to be a flag or configure > option to switch > the new interface off if not supported (maybe automatically based on the > math.h header). GCC already has __builtin_cexpi for this, so I think you can introduce cexpi implementation in libc, and then adjust expand_builtin_cexpi appropriately. I wonder if it would be possible to add a fallback cexpi implementation to libgcc.a that would be picked by the linker if there's no such symbol in libm? Alexander
libgcov as shared library and other issues
Hello, Here's the promised "libgcov summary"; sorry about the delay. So libgcov has a bit unusual design where: - on the one hand, the library is static-only, has PIC code and may be linked into shared libraries, - almost all gcov symbols have "hidden" visibility so they don't participate in dynamic linking - on the other hand, the __gcov_master symbol deliberately has default visibility, presumably with the intention that a running program has exactly one instance of this symbol, although the exact motivation is unclear to me. This latter point does not reliably work as intended though: there are scenarios where a dynamically linked program will have multiple __gcov_masters anyway: - via repeated dlopen(RTLD_LOCAL) with main executable not linked against libgcov or not exporting libgcov symbols (as in PR 83879) - when shared libraries have version scripts that hide their __gcov_master - when -Bsymbolic is in effect Additionally, indirect call profiling symbols are not hidden either, and that leads to extra complications. Since there are multiple symbols, during dynamic linking they may be partially interposed. PR 84107 demonstrates how this leads to libgcov segfaulting in a fairly simple and legitimate program. Bottom line: static linking code with default-visibility symbols into shared libraries is problematic. So one strategy is to ensure all gcov symbols have hidden visibility. That would isolate gcov instances in each shared library loaded in the program, and each library would have the responsibility to write out its counters when unloaded. Also, __gcov_dump would dump only the counters specific to the current library. I may be missing something here so it might be nice to unearth why exactly __gcov_master is intended to be global. Another strategy is to introduce libgcov.so and have it host either all libgcov symbols or just those that by design are required to exist once in the program. When talking to Richi at the Cauldron I got the impression he'd question if shared libgcov is worth the cost, e.g. would it make any easier for users to mix two libraries, one linked against older libgcov, and another with a newer (something that doesn't work at all now, but would be nice to support if I understood Richard correctly). Alexander
Re: Backporting gcc_qsort
On Mon, 1 Oct 2018, Jeff Law wrote: > To add a bit more context for Cory. > > Generally backports are limited to fixing regressions and serious code > generation bugs. While we do make some exceptions, those are good > general guidelines. > > I don't think the qsort changes warrant an exception. Personally I think in this case there isn't a strong reason to backport, the patch is fairly isolated, so individuals or companies that need it should have no problem backporting it on their own. Previously, Franz Sirl reported back in June they've used the patch to achieve matching output on their Linux-hosted vs Cygwin-hosted cross-compilers based on GCC 8: https://gcc.gnu.org/ml/gcc-patches/2018-06/msg00751.html Alexander
Re: movmem pattern and missed alignment
On Mon, 8 Oct 2018, Michael Matz wrote: > > Ok, but why is that not a bug? The whole point of passing alignment to > > the movmem pattern is to let it generate code that takes advantage of > > the alignment. So we get a missed optimization. > > Only if you somewhere visibly add accesses to *i and *j. Without them you > only have the "accesses" via memcpy, and as Richi says, those don't imply > any alignment requirements. The i and j pointers might validly be char* > pointers in disguise and hence be in fact only 1-aligned. I.e. there's > nothing in your small example program from which GCC can infer that those > two global pointers are in fact 2-aligned. Well, it's not that simple. C11 6.3.2.3 p7 makes it undefined to form an 'int *' value that is not suitably aligned: A pointer to an object type may be converted to a pointer to a different object type. If the resulting pointer is not correctly aligned for the referenced type, the behavior is undefined. So in addition to what you said, we should probably say that GCC decides not to exploit this UB in order to allow code to round-trip pointer values via arbitrary pointer types? To put Michael's explanation in different words: This is not obviously a bug, because static pointer type does not imply the dynamic pointed-to type. The caller of 'f1' could look like void call_f1(void) { short ibuf[20] = {0}, jbuf[20] = {0}; i = (void *) ibuf; j = (void *) jbuf; f1(); } and it's valid to memcpy from jbuf to ibuf, memcpy does not "see" the static pointer type, and works as if by dereferencing 'char *' pointers. (although as mentioned above it's more subtly invalid when assigning to i and j). If 'f1' dereferences 'i', GCC may deduce that dynamic type of '*i' is 'int' and therefore 'i' must be suitably aligned. But in absence of dereferences GCC does not make assumptions about dynamic type and alignment. Alexander
Re: movmem pattern and missed alignment
On Tue, 9 Oct 2018, Richard Biener wrote: > >This had worked as Paul expects until GCC 4.4 IIRC and this was perfectly OK > >for every language on strict-alignment platforms. This was changed only > >because of SSE on x86. > > And because we ended up ignoring all pointer casts. It's not quite obvious what SSE has to do with this - any hint please? (according to my quick check this changed between gcc-4.5 and gcc-4.6) Alexander
Re: movmem pattern and missed alignment
On Tue, 9 Oct 2018, Richard Biener wrote: > > then we cannot set the alignment of i_1 at/after k = *i_1 because doing so > would > affect the alignment test which we'd then optimize away. We'd need to > introduce > a SSA copy to get a new SSA name but that would be optimized away quickly. We preserve __builtin_assume_aligned up to pass-fold-all-builtins, so would it work to emit it just before the memcpy i_2 = __builtin_assume_aligned(i_1, 4); __builtin_memcpy(j, i_2, 32); in theory? Alexander
Re: avoidance of lea after 5 operations?
On Thu, 11 Oct 2018, Jason A. Donenfeld wrote: > > I realize this is probably a fairly trivial matter, but I am very > curious if somebody knows which heuristic gcc is applying here, and > why exactly. It's not something done by any other compiler I could > find, and it only started happening with gcc 6. It's a change in register allocation, gcc selects eax instead of esi for the shifts. Doesn't appear to be obviously intentional, could be a bug or bad luck. Alexander
Re: Is it a bug allowing to copy GIMPLE_ASM with labels?
On Sat, 29 Dec 2018, Bin.Cheng wrote: > tracer-1.c: Assembler messages: > tracer-1.c:16: Error: symbol `foo_label' is already defined > > Root cause is in tracer.c which duplicates basic block without > checking if any GIMPLE_ASM defines labels. > Is this a bug or invalid code? This is invalid code, GCC documentation is clear that the compiler may duplicate inline asm statements (passes other than tracer can do that too, loop unswitching just to give one example). We don't provide a way to write an asm that wouldn't be duplicated. Alexander
Re: Idea: extend gcc to save C from the hell of intel vector instructions
On Wed, 20 Feb 2019, Warren D Smith wrote: > but if I try to replace that with the nicer (since more portable) >c = __builtin_shuffle(a, b); > then > error: use of unknown builtin '__builtin_shuffle' > [-Wimplicit-function-declaration] Most likely you're on OS X and the 'gcc' command actually invokes Clang/LLVM. Clang does not implement this builtin (there's __builtin_shufflevector with a different interface — see Clang documentation for details). Alexander
Re: GCC turns &~ into | due to undefined bit-shift without warning
On Thu, 21 Mar 2019, Richard Biener wrote: > > Maybe an example would help. > > > > Consider this code: > > > > for (int i = start; i < limit; i++) { > > foo(i * 5); > > } > > > > Should GCC be entitled to turn it into > > > > int limit_tmp = i * 5; > > for (int i = start * 5; i < limit_tmp; i += 5) { > > foo(i); > > } > > > > If you answered "Yes, GCC should be allowed to do this", would you > > want a warning? And how many such warnings might there be in a typical > > program? > > I assume i is signed int. Even then GCC may not do this unless it knows > the loop is entered (start < limit). Additionally, the compiler needs to prove that 'foo' always returns normally (i.e. cannot invoke exit/longjmp or such). Alexander
Re: [RFC] Disabling ICF for interrupt functions
On Fri, 19 Jul 2019, Jozef Lawrynowicz wrote: > For MSP430, the folding of identical functions marked with the "interrupt" > attribute by -fipa-icf-functions results in wrong code being generated. > Interrupts have different calling conventions than regular functions, so > inserting a CALL from one identical interrupt to another is not correct and > will result in stack corruption. But ICF by creating an alias would be fine, correct? As I understand, the real issue here is that gcc does not know how to correctly emit a call to "interrupt" functions (because they have unusual ABI and exist basically to have their address stored somewhere). So I think the solution shouldn't be in disabling ICF altogether, but rather in adding a way to recognize that a function has quasi-unknown ABI and thus not directly callable (so any other optimization can see that it may not emit a call to this function), then teaching ICF to check that when deciding to fold by creating a wrapper. (would it be possible to tell ICF that addresses of interrupt functions are not significant so it can fold them by creating aliases?) Alexander
Re: [RFC] Disabling ICF for interrupt functions
On Mon, 22 Jul 2019, Jozef Lawrynowicz wrote: > This would have to be caught at the point that an optimization pass > first considers inserting a CALL to the interrupt, i.e., if the machine > description tries to prevent the generation of a call to an interrupt function > once the RTL has been generated (e.g. by blanking on the define_expand for > "call"), we are going to have ICEs/wrong code generated a lot of the time. > Particularly in the case originally mentioned here - there would be an empty > interrupt function. Yeah, I imagine it would need to be a new target hook direct_call_allowed_p receiving a function decl, or something like that. > > (would it be possible to tell ICF that addresses of interrupt functions are > > not significant so it can fold them by creating aliases?) > > I'll take a look. Sorry, I didn't say explicitly, but that was meant more as a remark to IPA maintainers: currently in GCC "address taken" implies "address significant", so "address not significant" would have to be a new attribute, or a new decl bit (maybe preferable for languages where function addresses are not significant by default). Alexander
Re: asking for __attribute__((aligned()) clarification
On Mon, 19 Aug 2019, Richard Earnshaw (lists) wrote: > Correct, but note that you can only pack structs and unions, not basic types. > there is no way of under-aligning a basic type except by wrapping it in a > struct. I don't think that's true. In GCC-9 the doc for 'aligned' attribute has been significantly revised, and now ends with When used as part of a typedef, the aligned attribute can both increase and decrease alignment, and specifying the packed attribute generates a warning. (but I'm sure defacto behavior of accepting and honoring reduced alignment on a typedef'ed scalar type goes way earlier than gcc-9) Alexander
Re: Aw: Re: asking for __attribute__((aligned()) clarification
On Tue, 20 Aug 2019, "Markus Fröschle" wrote: > Thank you (and others) for your answers. Now I'm just as smart as before, > however. > > Is it a supported, documented, 'long term' feature we can rely on or not? > > If yes, I would expect it to be properly documented. If not, never mind. I think it's properly documented in gcc-9: https://gcc.gnu.org/onlinedocs/gcc-9.2.0/gcc/Common-Type-Attributes.html (the "old" behavior where the compiler would neither honor reduced alignment nor issue a warning seems questionable, the new documentation promises a more sensible approach) In portable code one can also use memcpy to move unaligned data, the compiler should translate it like an unaligned load/store when size is a suitable constant: int val; memcpy(&val, ptr, sizeof val); (or __builtin_memcpy when -ffreestanding is in effect) Alexander
Re: asking for __attribute__((aligned()) clarification
On Wed, 21 Aug 2019, Paul Koning wrote: > I agree, but if the new approach generates a warning for code that was written > to the old rules, that would be unfortunate. FWIW I don't know which GCC versions accepted 'packed' on a scalar type. Already in 2006 GCC 3.4 would issue a warning: $ echo 'typedef int ui __attribute__((packed));' | gcc34 -xc - -S -o- .file "" :1: warning: `packed' attribute ignored .section.note.GNU-stack,"",@progbits .ident "GCC: (GNU) 3.4.6 20060404 (Red Hat 3.4.6-4)" > Yes. But last I tried, optimizing that for > 1 alignment is problematic > because that information often doesn't make it down to the target code even > though it is documented to do so. Thanks, indeed this memcpy solution is not so well suited for that. Alexander
Re: -fpatchable-function-entry: leverage multi-byte NOP on x86
On Mon, 6 Jan 2020, Martin Liška wrote: > You are right, we do not leverage multi-byte NOPs. Note that the support > depends > on a CPU model (-march) and the similar code is quite complex in binutils: > https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=gas/config/tc-i386.c;h=d0b8f2624a1885d83d2595474bfd78ae844f48f2;hb=HEAD#l1441 > > I'm not sure how worthy would it be to implement that? Huh? Surely the right move would be to ask Binutils to expose that via a new pseudo-op, like .balign but requesting a specific space rather than aligning up to a boundary. Alexander
Re: [RFC] builtin functions and `-ffreestanding -nostartfies` with static binaries
On Fri, 10 Jan 2020, Siddhesh Poyarekar wrote: > I spent some time thinking about this and while it's trivial to fix by > disabling ifuncs for static glibc, I wanted a solution that wasn't such > a big hammer. The other alternative I could think of is to have an > exported alias (called __builtin_strlen for example instead of strlen) > of a default implementation of the builtin function in glibc that gcc > generates a call to if freestanding && nostartfiles && static. In the Linaro bugreport you mention, > Basically, IFUNCs and freestanding don't mix. but really any libc (Glibc included) and -nostartfiles don't mix: stdio won't be initialized, TLS won't be setup, and pretty much all other libc-internal datastructures won't be properly setup. Almost no libc functions are callable, because for example if they try to access 'errno', they crash. Looking at the opening comment of the failing kselftest source: * This program tries to be as small as possible itself, to * avoid perturbing the system memory utilization with its * own execution. It also attempts to have as few dependencies * on kernel features as possible. * * It should be statically linked, with startup libs avoided. * It uses no library calls, and only the following 3 syscalls: * sysinfo(), write(), and _exit() so in fact allowing it to link with libc strlen would be contrary to its intent. The fix is simple: add -nodefaultlibs next to -nostartfiles in its Makefile, and write a trivial loop in place of __builtin_strlen. Alexander
Re: LTO remapping/deduction of machine modes of types/decls
On Wed, 4 Jan 2017, Richard Biener wrote: > My suggestion at that time isn't likely working in practice due to the > limitations Jakub outlines below. The situation is a bit unfortunate > but expect to run into more host(!) dependences in the LTO bytecode. > Yes, while it would be nice to LTO x86_64->arm and ppc64le->arm > LTO bytecode it very likely isn't going to work. Yes, I think it's not really practical to seek wide portability of LTO bytecode. After all, platform specifics get into constant expressions (e.g. 'int p = sizeof (void *);') and are also observable on the preprocessor level (e.g. via '#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__'). However the accelerator platform must be compatible with the host platform in almost all ABI (storage layout?) features such as type sizes and alignments, endianness, default signedness of char, bitfield layout, and possibly others (but yet in the other subthread I was arguing that compromising and making 'long double' only partially compatible makes sense). Thus, portability issue is much smaller in scope here. I think it's a bit unfortunate that the discussion really focused on the trouble with floating-point types. I'd really appreciate any insight on the other questions that were raised, such as whether the decl mode should match that decl's type mode. For floating types, I believe in the long run it should be possible to tag tree type nodes with the floating-point type 'kind' such as IEEE binary, IEEE decimal, accum/fract/sat, or IBM double-double. For our original goal, I think we'll switch to the other solution I've outlined in the opening mail, i.e. propagating mode tables at WPA stage and keeping enough information to know if the section comes from the host or native compiler. Thanks. Alexander
Re: LTO remapping/deduction of machine modes of types/decls
On Tue, 10 Jan 2017, Richard Biener wrote: > In general I think they should match. But without seeing concrete > examples of where they do not I can't comment on whether such exceptions > make sense. For example if you adjust a DECLs alignment and then > re-layout it I'd expect you might get a non-BLKmode mode for an > aggregate in some circumstances -- but then decl and type are not 1:1 > compatible (due to different alignment), but this case is clearly desired > as requiring type copies for the sake of alignment would be wasteful. Thanks; Vlad will follow up with (I believe) a different kind of mismatches originating in the C++ front-end. > > For our original goal, I think we'll switch to the other solution I've > > outlined in the opening mail, i.e. propagating mode tables at WPA stage > > and keeping enough information to know if the section comes from the > > host or native compiler. > > So add a hack ontop of the hack? Ugh. So why exactly doesn't it > already work? It looks like decls and types have their modes > "fixed" with the per-file mode table at WPA time. So what is missing > is to "fix" modes in the per-function sections that are not touched > by WPA? WPA re-streams packed function bodies as-is, so anything referred to from within just the body won't be subject to mode remapping; I think only modes of toplevel declarations and functions' arguments will be remapped. And I believe it wouldn't be acceptable to unpack/remap/repack function bodies at WPA stage (it's contrary to LTO scalability goal). Alexander
Re: LTO remapping/deduction of machine modes of types/decls
On Wed, 11 Jan 2017, Richard Biener wrote: > > WPA re-streams packed function bodies as-is, so anything referred to > > from within just the body won't be subject to mode remapping; I think > > only modes of toplevel declarations and functions' arguments will be > > remapped. And I believe it wouldn't be acceptable to unpack/remap/repack > > function bodies at WPA stage (it's contrary to LTO scalability goal). > > Yes indeed. But this means the mode-maps have to be per function > section (with possibly a way to "share" them?). Or we need a way > to annotate function sections with "no need to re-map" as the > native nvptx sections don't need remapping and the others all use > the same map? Right, the latter: we know that sections coming from the native compiler already have the right modes and thus need no remapping, and the sections coming from the host compiler all need remapping (and will use the same mapping). Prefixes of per-function section names already carry the distinction (".gnu.lto_foo" vs. ".gnu.offload_lto_foo"). Alexander
Re: Improving code generation in the nvptx back end
On Fri, 17 Feb 2017, Thomas Schwinge wrote: > On Fri, 17 Feb 2017 14:00:09 +0100, I wrote: > > [...] for "normal" functions there is no reason to use the > > ".param" space for passing arguments in and out of functions. We can > > then get rid of the boilerplate code to move ".param %in_ar*" into ".reg > > %ar*", and the other way round for "%value_out"/"%value". This will then > > also simplify the call sites, where all that code "evaporates". That's > > actually something I started to look into, many months ago, and I now > > just dug out those changes, and will post them later. > > > > (Very likely, the PTX "JIT" compiler will do the very same thing without > > difficulty, but why not directly generate code that is less verbose to > > read?) In general you cannot use this shorthand notation because PTX interop guidelines explicitly prescribe using the .param space for argument passing. See https://docs.nvidia.com/cuda/ptx-writers-guide-to-interoperability/ , section 3. So at best GCC can use it for calls where interop concerns are guaranteed to not arise: when the callee is not externally visible, and does not have its address taken. And there's a question of how well it's going to work after time passes, since other compilers always use the verbose form (and thus the .reg calling style is not frequently exercised). Alexander
Re: GPLv3 clarification - what constitutes IR
On Mon, 6 Mar 2017, Richard Biener wrote: > >I am not a lawyer and this is not legal advice. > > > >Generating SPIR-V output would not cause that output to become GPLv3 > >licensed. However, linking the result against the GCC support > >libraries, as is normally required for any program generated by GCC, > >and then distributing the resulting executable to other people, would > >require you to use an eligible compilation process (as defined by the > >GCC Runtime Library Exception license that you cite). What this means > >in practice is that you can not take SPIR-V, do further processing it > >using a proprietary compiler, link the result with the GCC runtime > >libraries, and then distribute the resulting program to anybody else. > > > >I don't think it is necessary to determine whether SPIR-V is "target > >code" or "intermediate representation" to draw that conclusion. > > Note we already have the HSAIL and PTX backends which have the very same > (non-)problem. Both invoke a proprietary compiler for final compilation. Sorry, to me this statement sounds a bit ambiguous, so allow me to clarify: there is no mandatory invocation of proprietary code generators taking place as part of GCC invocation (I think there's none at all in case of HSAIL, and in case of PTX it's done for the purpose of syntax checking and can be omitted). Translation of HSAIL/PTX assembly to GPU binary code takes place when the host executable runs, on user's machine, by invoking corresponding libraries (in case of PTX it's NVIDIA's CUDA driver library). There is no support for translating HSAIL/PTX on the developer's machine and linking the resulting GPU binary code into GCC-produced executable. Hope that helps. Alexander
Re: [RFA] update ggc_min_heapsize_heuristic()
On Sun, 9 Apr 2017, Markus Trippelsdorf wrote: > The minimum size heuristic for the garbage collector's heap, before it > starts collecting, was last updated over ten years ago. > It currently has a hard upper limit of 128MB. > This is too low for current machines where 8GB of RAM is normal. > So, it seems to me, a new upper bound of 1GB would be appropriate. While amount of available RAM has grown, so has the number of available CPU cores (counteracting RAM growth for parallel builds). Building under a virtualized environment with less-than-host RAM got also more common I think. Bumping it all the way up to 1GB seems excessive, how did you arrive at that figure? E.g. my recollection from watching a Firefox build is that most of compiler instances need under 0.5GB (RSS). > Compile times of large C++ projects improve by over 10% due to this > change. Can you explain a bit more, what projects you've tested?.. 10+% looks surprisingly high to me. > What do you think? I wonder if it's possible to reap most of the compile time benefit with a bit more modest gc threshold increase? Thanks. Alexander
Re: Linux and Windows generate different binaries
On Fri, 14 Jul 2017, Yuri Gribov wrote: > I've also detect transitiveness violation compare_assert_loc > (tree-vrp.c), will send fix once tests are done. There are more issues still, see the thread starting at https://gcc.gnu.org/ml/gcc-patches/2017-07/msg00899.html Alexander
Re: Linux and Windows generate different binaries
On Sat, 15 Jul 2017, Segher Boessenkool wrote: > Would it hurt us to use stable sorts *everywhere*? Stability (using the usual definition: keeping the original order of elements that compare equal) is not required to achieve reproducibility [*]. GCC could import or nih any non-randomized implementation of a [potentially not stable, e.g. qsort] sorting routine, and that would be sufficient to eliminate this source of codegen differences. Alexander [*] nor would it be sufficient, given our current practice of passing invalid comparators to libc sort, at which point anything can happen due to undefined behavior
Re: Linux and Windows generate different binaries
On Sun, 16 Jul 2017, Segher Boessenkool wrote: > I am well aware, and that is not what I asked. If we would use stable > sorts everywhere How? There's no stable sort in libc and switching over to std::stable_sort would be problematic. The obvious approach is adding an implementation of a stable sort in GCC, but at that point it doesn't matter if it's stable, the fact it's deterministic is sufficient for reproducibility. > we would not have to think about whether some sorting routine has to be > stable or if we can get away with a (slightly slower) non-stable sort. I think you mean '(slightly faster)'. > That is just a plain bug, undefined behaviour even (C11 7.22.5/4). > Of course it needs to be fixed. I've posted patches towards this goal. Alexander
Re: Linux and Windows generate different binaries
On Sun, 16 Jul 2017, Segher Boessenkool wrote: > > How? There's no stable sort in libc and switching over to std::stable_sort > > would be problematic. > > Why? - you'd need to decide if the build time cost of extra 8000+ lines lines brought in by (per each TU) is acceptable; - you'd need to decide if the code size cost of multiple instantiations of template stable_sort is acceptable (or take measures to unify them); - you'd need to adapt comparators, as STL uses a different interface that C qsort; - you'd need to ensure it doesn't lead to a noticeable slowdown. (unrelated, but calls to std::stable sort cannot be intercepted by Yuri's sortcheck, and of course my recent sortcheck-like patch entirely missed it too) > Sure. Some of our sorts in fact require stable sort though At moment only bb-reorder appears to use std::stable_sort, is that what you meant, or are there more places? Alexander
Re: RFC: C extension to support variable-length vector types
On Thu, 3 Aug 2017, Richard Biener wrote: > Btw., I did this once to represent constrained expressions on > multi-dimensional arrays in SSA form. There control (aka loop) structure was > also implicit. Google for 'middle-end array expressions'. The C interface > was builtins and VLAs. The description at https://gcc.gnu.org/wiki/MiddleEndArrays refers to http://www.suse.de/~rguenther/guenther.pdf but this redirects to users.suse.de, for which DNS resolution fails. The Web Archive doesn't seem to have a copy. Any chance this might be available elsewhere? Thanks. Alexander
Re: RFC: Improving GCC8 default option settings
On Tue, 12 Sep 2017, Wilco Dijkstra wrote: > * Make -fno-math-errno the default - this mostly affects the code generated > for > sqrt, which should be treated just like floating point division and not set > errno by default (unless you explicitly select C89 mode). (note that this can be selectively enabled by targets where libm never sets errno in the first place, docs call out Darwin as one such target, but musl-libc targets have this property too) > * Make -fno-trapping-math the default - another obvious one. From the docs: > "Compile code assuming that floating-point operations cannot generate >user-visible traps." > There isn't a lot of code that actually uses user-visible traps (if any - > many CPUs don't even support user traps as it's an optional IEEE feature). > So assuming trapping math by default is way too conservative since there is > no obvious benefit to users. OTOH -O options are understood to _never_ sacrifice standards compliance, with the exception of -Ofast. I believe that's an important property to keep. Maybe it's possible to treat -fno-trapping-math similar to -ffp-contract=fast, i.e. implicitly enable it in the default C-with-GNU-extensions mode, keeping strict-compliance mode (-std=c11 as opposed to gnu11) untouched? In any case it shouldn't be hard to issue a warning if fenv.h functions are used when -fno-trapping-math/-fno-rounding-math is enabled. If the above doesn't fly, I believe adopting and promoting a single option for non-value-changing math optimizations (-fno-math-errno -fno-trapping-math, plus -fno-rounding-math -fno-signaling-nans when they're no longer default) would be nice. > * Make -fno-common the default - this was originally needed for pre-ANSI C, > but > is optional in C (not sure whether it is still in C99/C11). This can > significantly improve code generation on targets that use anchors for > globals > (note the linker could report a more helpful message when ancient code that > requires -fcommon fails to link). I think in ISO C situations where -fcommon allows link to succeed fall under undefined behavior, which in GNU toolchain is defined to match the historical behavior. I assume the main issue with this is the amount of legacy code that would cause a link failure if -fno-common is made default - thus, is there anybody in position to trigger a full-distro rebuild with gcc patched to enable -fno-common, and compare before/after build failure stats? Thanks. Alexander
Re: atomic_thread_fence() semantics
On Thu, 19 Oct 2017, Andrew Haley wrote: > On 19/10/17 12:58, Mattias Rönnblom wrote: > > Did I misunderstand the semantics of > > atomic_thread_fence+memory_order_release? > > No, you did not. This looks like a bug. Please report it. This bug is fixed on trunk, so should work from gcc-8 onwards (PR 80640). Alexander
Re: atomic_thread_fence() semantics
On Fri, 20 Oct 2017, Torvald Riegel wrote: > On Thu, 2017-10-19 at 15:31 +0300, Alexander Monakov wrote: > > On Thu, 19 Oct 2017, Andrew Haley wrote: > > > No, you did not. This looks like a bug. Please report it. > > > > This bug is fixed on trunk, so should work from gcc-8 onwards (PR 80640). > > The test case is invalid (I added some more detail as a comment on this > bug). Sorry, I was imprecise. To be clear, the issue I referred to above as the "bug [that was] fixed on trunk" is the issue Andrew Haley pointed out: when GCC transitioned from GIMPLE to RTL IR, empty RTL was emitted for the fence statement, losing its compile-time effect as a compiler memory barrier entirely. I agree that the testcase in the opening message of this thread is not valid in the sense that this reordering could not have changed the behavior of a conforming program, but the optimization that GCC performed here was entirely unintentional, not something the compiler is presently designed to do. Alexander
Re: Unstable build/host qsorts causing differing generated target code
On Fri, 12 Jan 2018, Jakub Jelinek wrote: > The qsort checking failures are tracked in http://gcc.gnu.org/PR82407 > meta bug, 8 bugs in there are fixed, 2 known ones remain. Note that qsort_chk only catches really bad issues where the compiler invokes undefined behavior by passing an invalid comparator to qsort; differences between Glibc and musl-hosted compilers may remain because qsort is not quaranteed to be a stable sort: when sorting an array {A, B} where A and B are not bitwise-identical but cmp(A, B) returns 0, the implementation of qsort may yield either {B, A} or {A, B}, and that may cause codegen differences. (in other words, bootstrapping on a libc with randomized qsort has a good chance to run into bootstrap miscompares even if qsort_chk-clean) Alexander
Re: Unstable build/host qsorts causing differing generated target code
On Fri, 12 Jan 2018, Jeff Law wrote: > THe key here is the results can differ if the comparison function is not > stable. That's inherent in the qsort algorithms. I'm afraid 'stable' is unclear/ambiguous in this context. I'd rather say 'if the comparator returns 0 if and only if the items being compared are bitwise identical'. Otherwise qsort, not being a guaranteed-stable sort, has a choice as to how reorder non-identical items that compare equal. > But, if the comparison functions are fixed, then the implementation > differences between the qsorts won't matter. > > Alexander Monokov has led an effort to identify cases where the > comparison functions do not provide a stable ordering and to fix them. No. The qsort_chk effort was limited to catching instances where comparators are invalid, i.e. lack anti-commutativity (may indicate A < B < A) or transitivity property (may indicate A < B < C < A). Fixing them doesn't imply making corresponding qsort invocations stable. > Some remain, but the majority have been addressed over the last year. > His work also includes a qsort checking implementation to try and spot > these problems as part of GCC's internal consistency checking mechanisms. Well, currently qsort_chk only checks for validity. It could in principle check for stability, but for each unstable sort it's rather hard to analyze whether it may ultimately cause codegen changes, and as a result patches may be impossible to justify. Alexander
Re: Unstable build/host qsorts causing differing generated target code
On Fri, 12 Jan 2018, Joseph Myers wrote: > On Fri, 12 Jan 2018, Alexander Monakov wrote: > > > No. The qsort_chk effort was limited to catching instances where comparators > > are invalid, i.e. lack anti-commutativity (may indicate A < B < A) or > > transitivity property (may indicate A < B < C < A). Fixing them doesn't > > imply making corresponding qsort invocations stable. > > Incidentally, does it detect being invalid because of comparing A != A? If A appears twice at different positions in the array yes, otherwise no: qsort_chk never passes two equal pointers to the comparator. This is intentional: an earlier effort by Yuri Gribov tried to enforce reflexivity, but that caught instances where input arrays could not contain identical items. So under the assumption that qsort would never compare an element to itself, catching and fixing that wouldn't make a difference in practice - apart from complicating the comparator. Alexander
Re: GCC interpretation of C11 atomics (DR 459)
Hello, Although I wouldn't like to fight defending GCC's design change here, let me offer a couple of corrections/additions so everyone is on the same page: On Mon, 26 Feb 2018, Ruslan Nikolaev via gcc wrote: > > 1. Not consistent with clang/llvm which completely supports double-width > atomics for arm32, arm64, x86 and x86-64 making it possible to write portable > code (w/o specific extensions or assembly code) across all these architectures > (which is finally possible with C11!).The behavior of clang: if mxc16 is > specified, cmpxchg16b is generated for x86-64 (without any calls to > libatomic), otherwise -- redirection to libatomic. For arm64, ldaxp/staxp are > always generated. In my opinion, this is very logical and non-confusing. Note that there's more issues to that than just behavior on readonly memory: you need to ensure that the whole program, including all static and shared libraries, is compiled with -mcx16 (and currently there's no ld.so/ld-level support to ensure that), or you'd need to be sure that it's safe to mix code compiled with different -mcx16 settings because it never happens to interop on wide atomic objects. (if you mix -mcx16 and -mno-cx16 code operating on the same 128-bit object, you get wrong code that will appear to work >99% of the time) > 3. The behavior is inconsistent even within GCC. Older (and more limited, less > portable, etc) __sync builtins still use cmpxchg16b directly, newer __atomic > and C11 -- do not. Moreover, __sync builtins are probably less suitable for > arm/arm64. Note that there's no "load" function in the __sync family, so the original concern about operations on readonly memory does not apply. > For these reasons, it may be a good idea if GCC folks reconsider past > decision. And just to clarify: if mcx16 (x86-64) is not specified during > compilation, it is totally OK to redirect to libatomic, and there make the > final decision if target CPU supports a given instruction or not. But if it is > specified, it makes sense for performance reasons and lock-freedom guarantees > to always generate it directly. You don't mention it directly, so just to make it clear for readers: on systems where GNU IFUNC extension is available (i.e. on Glibc), libatomic tries to do exactly that: test for cmpxchg16b availability and redirect 128-bit atomics to lock-free RMW implementations if so. (I don't like this solution) Thanks. Alexander
Re: Fw: GCC interpretation of C11 atomics (DR 459)
On Mon, 26 Feb 2018, Ruslan Nikolaev via gcc wrote: > Well, if libatomic is already doing it when corresponding CPU feature is > available (i.e., effectively implementing operations using cmpxchg16b), I do > not see any problem here. mcx16 implies that you *have* cmpxchg16b, therefore > other code compiled without -mcx16 flag will go to libatomic. Inside > libatomic, it will detect that cmpxchg16b *is* available, thus making code > compiled with and without -mcx16 flag completely compatible on a given system. > Or do I miss something here? I'd say the main issue is that libatomic is not guaranteed to work like that. Today it relies on IFUNC for redirection, so you may (and not "will") get the desired behavior on Glibc (implying Linux), not on other OSes, and neither on Linux with non-GNU libc (nor on bare metal, for that matter). Alexander
Re: GCC interpretation of C11 atomics (DR 459)
On Mon, 26 Feb 2018, Szabolcs Nagy wrote: > > rmw load is only valid if the implementation can > guarantee that atomic objects are never read-only. OK, but that sounds like a matter of not emitting atomic objects into .rodata, which shouldn't be a big problem, if not for backwards compatibility concern? > current implementations on linux (including clang) > don't do that, so an rmw load can observably break > conforming c code: a static global const object is > placed in .rodata section and thus rmw on it is a > crash at runtime contrary to c standard requirements. Note that in your example GCC emits 'x' as a common symbol, you need '... x = { 0 };' for it to appear in .rodata, > on an aarch64 machine clang miscompiles this code: [...] and then with new enough libatomic on Glibc this segfaults with GCC on x86_64 too due to IFUNC redirection mentioned in the other subthread. Alexander
Re: Why does IRA force all pseudos live across a setjmp call to be spilled?
On Fri, 2 Mar 2018, Peter Bergner wrote: > But currently ira-lives.c:process_bb_node_lives() has: > > /* Don't allocate allocnos that cross setjmps or any > call, if this function receives a nonlocal > goto. */ > if (cfun->has_nonlocal_label > || find_reg_note (insn, REG_SETJMP, > NULL_RTX) != NULL_RTX) > { > SET_HARD_REG_SET (OBJECT_CONFLICT_HARD_REGS (obj)); > SET_HARD_REG_SET (OBJECT_TOTAL_CONFLICT_HARD_REGS (obj)); > } > > ...which forces us to spill everything live across the setjmp by forcing > the pseudos to interfere all hardregs. That can't be good for performance. > What am I missing? FWIW there's a similar issue with exceptions where IRA chooses memory for pseudos inside the loop even though the throwing call is outside: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82242#c3 Alexander
Re: GCC Compiler Optimization ignores or mistreats MFENCE memory barrier related instruction
On Fri, 13 Apr 2018, Vivek Kinhekar wrote: > The mfence instruction with memory clobber asm instruction should create a > barrier between division and printf instructions. No, floating-point division does not touch memory, so the asm does not (and need not) restrict its motion. Alexander
Re: Sched1 stability issue
On Wed, 4 Jul 2018, Kugan Vivekanandarajah wrote: > We noticed a difference in the code generated for aarch64 gcc 7.2 > hosted in Linux vs mingw. AFIK, we are supposed to produce the same > output. > > For the testacse we have (quite large and I am trying to reduce), the > difference comes from sched1 pass. If I disable sched1 the difference > is going away. > > Is this a known issue? Attached is the sched1 dump snippet where there > is the difference. The rank_for_schedule comparator used for qsort in the scheduler is known to be invalid; some issues have been fixed in gcc-8, but some remain (you can search the bugzilla for qsort_chk issues). Since the comparator is invalid, different qsort implementations reorder the ready list differently. In gcc-9 qsort calls use gcc_qsort instead and thus wouldn't diverge. Alexander
Re: ChangeLog's: do we have to?
On Thu, 5 Jul 2018, Richard Kenner wrote: > > After 20 years of hacking on GCC I feel like I have literally wasted > > days of my life typing out ChangeLog entries that could have easily been > > generated programmatically. > > > > Can someone refresh my memory here, what are the remaining arguments for > > requiring ChangeLog entries? > > I take the position that any ChangeLog entry that could have been generated > automatically is not a good one. Yes, the list of functions changed could > be generated automatically. (Isn't there actually a way to do that in > emacs? I could have sworn I saw it, though I never used it much.) But the > *purpose* of the change to that function can't be generated automatically > because the ChangeLog entry is the only place that should have that > information. The comments should say what a function currently does, but > isn't the place for a history lesson. That data belongs only in ChangeLog GCC ChangeLogs don't record the purpose of the change. They say what changed, but not why. As far as I know, this ChangeLog style helped in pre-Subversion times when source control tools tracked changes per-file, not per-tree. Alexander
Re: ChangeLog's: do we have to?
On Mon, 23 Jul 2018, Segher Boessenkool wrote: > For example for .md files you can use > > [diff "md"] > xfuncname = "^\\(define.*$" > > in your local clone's .git/config > > and > > *.md diff=md > > in .gitattributes (somewhere in the source tree). Not necessarily in the source tree: individual users can put that into their $XDG_CONFIG_HOME/git/attributes or $HOME/.config/git/attributes. Likewise the previous snippet can go into $HOME/.gitconfig rather than each individual cloned tree. (the point is, there's no need to split this quality-of-life change between the repository and the user's setup - it can be done once by a user and will work for all future checkouts) Alexander
Re: Question about GCC benchmarks and uninitialized variables
On Tue, 24 Jul 2018, Fredrik Hederstierna wrote: > So my question is how to approach this problems when doing benchmarking, > ofcourse we want the benchmark to mirror as near as 'real life' code as > possible. But if code contains real bugs, and issues that could cause > unpredictable code generation, should such code be banned from benchmarking, > since results might be misleading? Well, all benchmarks are going to be imperfect reflections of real-life workloads in the first place, so their bugs just increase the degree to which they are misleading. When a new compiler version starts to treat some undefined piece of code differently, it can cause a range of effects from code size perturbations as in your case, to completely invalidating the benchmark as in spec2k6 x264 benchmark's case (where GCC exploited undefined behavior in a loop, turning it to an infinite loop that eventually segfaulted). Perhaps even though results on individual benchmarks can vary wildly, aggregated results across a wide range of non-toy benchmarks may be indicative of ... something, because they are unlikely to all exhibit the same "bugs". > On the other hand, the compiler should > generate best code for any input? Engineering effort is limited, so it's probably better to go for generating good code for inputs that are likely to resemble actively used code (and in actively used&maintained code, bugs can be reported and fixed) :) > What do you think, should benchmarking code not being allowed to have eg > warnings like -Wuninitialized and maybe -Wmaybe-uninitialized? Are there more > warnings that indicate unpredictable code generations due to bad code, or are > the root cause that these are 'bugs', and we should not allow real bugs at all > in benchmarking code? A blanket ban on warnings won't work, they have false positives (especially the -Wmaybe- one), and there exist code that validly uses uninitialized data. I don't have such a striking example for scalar variables, but for uninitialized arrays there's this sparse set algorithm (which GCC itself also uses): https://research.swtch.com/sparse I think good benchmarks sets should be able to evolve to account for newly discovered bugs, rather then remain frozen (which sounds like a reason to become obsolete sooner rather than later). Alexander
Re: Question related to GCC structure variable assignment optimization
On Fri, 27 Jul 2018, keshav tilak wrote: > This leads to GCC compiler issuing a call to `memcpy@PLT()' in function bar1. > > I want to create a position independent executable from this source > and run this on > a secure environment which implements ASLR and the loader disallows any binary > which has PLT/GOT based relocations. The linker should be able to relax those nominally-PLT calls to direct calls since it emits a PIE and a local definition is available. Therefore the loader (the dynamic linker) should not get a GOT relocation for this call. Are you sure the linker does not perform this relaxation in your case? If so, that's an issue (missed optimization) in the linker. That said, the GCC should be able to emit direct calls as in some cases, most notably the 32-bit x86 ABI, it causes a size/speed penalty the linker would not be able to clean up. What should work is (re)declaring memcpy with hidden visibility: __attribute__((visibility("hidden"))) void *memcpy(void *, const void *, size_t); or via the pragma, but today this doesn't work. I've opened a GCC bugreport for this: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86695 Alexander
Re: Questions related to creation of libgcov.so
On Fri, 3 Aug 2018, Martin Liška wrote: > I'm attaching current patch, so any comment is welcomed. Please consider passing -Wl,-z,now when linking the new shared library: gcov has a few thread-local variables that may be accessed in async-signal context, and Glibc has bugs related to lazy binding and/or allocation (not sure which exactly) of TLS symbols. Lazy binding doesn't buy anything for gcov, and that would be a cheap way to avoid a regression. On the other hand, it's not gcov's responsibility to work around long-standing bugs in Glibc, so I don't insist. Alexander
Questions/topics for UB BoF
Hello, For the upcoming Cauldron, I've registered a BoF on treatment of undefined behavior in GCC. To "promote" the BoF, collect some input prior to the event, and indirectly indicate my interests, I am offering the following light-hearted questionnaire. I would ask to mail responses directly to my address, not the list. If you prefer Google Forms, please fill https://goo.gl/forms/1sLhMtLLhorvzDm42 Development policies touched upon in items 2-6 is what I'd invite to discuss on the BoF. Thank you. Alexander You are wearing: [*] Distribution maintainer's hat [*] GCC developer's hat [*] GCC user's hat # Undefined behavior and compiler diagnostics 1. Some undefined behavior is relevant only at translation time, not execution time (for example: an unmatched ' or " character is encountered on a logical source line during tokenization). GCC typically issues a diagnostic when encountering such UB. Should GCC rather make use of the undefinedness in order to make the compiler simpler and faster? [*] Yes [*] No Details: _ 2. In some instances GCC is able to produce a warning for code that is certain to invoke undefined behavior at run time. Should GCC strive to diagnose that as much as practical, even at the cost of the compiler getting more complex? [*] Yes [*] No Details: _ 3. Sometimes GCC is also able to issue a warning for code that is likely (but not certain) to invoke undefined behavior. As such warnings may have false positives, should GCC nevertheless try to issue them too, where practical? [*] Yes [*] No Details: _ 4. When GCC assumes absence of undefined behavior in optimization, it leads to "surprising" transformations. This is generally not reportable to the user; however, -Waggressive-loop-optimization was added for one such case. If you are familiar with that flag, would you say overall it was worth it, and more warnings in that spirit would be nice to have? [*] Yes [*] No Details: _ # Undefined behavior and compiler optimizations 5. When GCC optimizations encounter code that is certain to invoke undefined behavior, they do not react consistently: at the moment we have a wide range of behaviors, like not transforming the code at all, or replacing it with a trap instruction. Should GCC apply one tactic consistently, and if so, which? [*] Yes [*] No Details: _ 6. Some GCC optimization/analysis functionality assumes absence of UB, and cannot detect that code invokes undefined behavior. Should GCC prefer to assume presence of UB in analysis, which would allow to diagnose it or transform to runtime trap? [*] Yes [*] No Details: _
Re: ChangeLog files: 8 spaces vs. a tabular
On Mon, 27 Aug 2018, Martin Liška wrote: > Hi. > > Recently I've noticed that I have wrongly set up my editor and > I installed quite some changes where my changelog entries > have 8 spaces instead of a tabular. > > I grepped that for all ChangeLog files (ignoring ChangeLog-{year} files) > and I see: Note that some files in your list appear only because they have 12 spaces indenting the second author in a multi-author change, which is intentional and probably does not need changing. Alexander
Re: Scheduling automaton question
On Fri, 11 Feb 2011, Bernd Schmidt wrote: > Suppose I have two insns, one reserving (A|B|C), and the other reserving > A. I'm observing that when the first one is scheduled in an otherwise > empty state, it reserves the A unit and blocks the second one from being > scheduled in the same cycle. This is a problem when there's an > anti-dependence of cost 0 between the two instructions. > > Vlad - two questions. Is this behaviour what you would expect to happen, > and how much work do you think would be involved to fix it (i.e. make > the first one transition to a state where we can still reserve any two > out of the three units)? Could you please clarify a bit: would the modified behavior match what your target CPU does? The current behavior matches CPUs without lookahead in instruction dispatch: the first insn goes to the first matching execution unit (A), the second has to wait. Alexander
Re: RTL Conditional and Call
On Sat, 31 Dec 2011, Matt Davis wrote: > Hi, > I am having an RTL problem trying to make a function call from a > COND_EXEC rtx. The reload pass has been called, and very simply I > want to compare on an 64bit x86 %rdx with a specific integer value, > and if that value is true, my function call executes. I can call the > function fine outside of the conditional, but when I set it in the > conditional expression, I get the following error: > > test.c:6:1: error: unrecognizable insn: Indeed, x86 does not have a "conditional call" instruction. You would have to generate the call in a separate basic block and add a conditional branch instruction around it. You can reference the following code, which attempts to convert any COND_EXECs to explicit control flow: http://gcc.gnu.org/ml/gcc-patches/2011-10/msg02383.html (but you will probably need to additionally generate comparison instructions). Hope that helps, Alexander
Re: Problems with selective scheduling
Hi, On Tue, 27 Oct 2009, Markus L wrote: Hi, I recently read the articles about the selective scheduling implementation and found it quite interesting, I especially liked the idea of how neatly software pipelining is integrated. The target I am working on is a VLIW DSP so obviously these things are very important for good code generation. However when compiling the following C function with -fselective-scheduling2 and -fsel-sched-pipelining I face a few problems. Increase verbosity of scheduler dumps to obtain more useful information by passing the following flags: -fdump-rtl-sched1-details -fdump-rtl-sched2-details -fsched-verbose=6 It may also be useful to compare the scheduler behaviour on your target to ia64. Note that building full-fledged cross-compiler wouldn't be necessary, just 'configure --target=ia64-linux && make all-gcc' and invoke gcc/cc1 (to produce dumps, change 'sched2' to 'mach' in the line above, since sel-sched is invoked from machine-reorg pass on ia64). More comments below. long dotproduct2(int *a, int *b) { int i; long s=0; for (i = 0; i < 256; i++) s += (long)*a++**b++; return s; } The output I get from sched2 pass is: ... Scheduling region 0 Scheduling on fences: (uid:32;seqno:6;) scanning new insn with uid = 80. deleting insn with uid = 80. Scheduled 0 bookkeeping copies, 0 insns needed bookkeeping, 0 insns renamed, 0 insns substituted Scheduling region 1 Scheduling on fences: (uid:72;seqno:1;) scanning new insn with uid = 81. deleting insn with uid = 81. Scheduled 0 bookkeeping copies, 0 insns needed bookkeeping, 0 insns renamed, 0 insns substituted Scheduling region 2 Scheduling on fences: (uid:65;seqno:1;) scanning new insn with uid = 82. deleting insn with uid = 82. Scheduled 0 bookkeeping copies, 0 insns needed bookkeeping, 0 insns renamed, 0 insns substituted (note 26 27 65 2 NOTE_INSN_FUNCTION_BEG) (insn:TI 65 26 30 2 dotprod2.c:2 (set (mem:QI (pre_dec (reg/f:QI 32 sp)) [0 S1 A16]) (reg/f:QI 32 sp)) 12 {pushqi1} (nil)) (insn 30 65 62 2 dotprod2.c:2 (set (reg/v:HI 16 a0l [orig:62 s ] [62]) (const_int 0 [0x0])) 6 {*zero_load_hi} (expr_list:REG_EQUAL (const_int 0 [0x0]) (nil))) (insn 62 30 66 2 dotprod2.c:2 (set (reg:QI 2 r2 [70]) (const_int 256 [0x100])) 5 {*constant_load_qi} (expr_list:REG_EQUAL (const_int 256 [0x100]) (nil))) (insn:TI 66 62 67 2 dotprod2.c:2 (set (mem:QI (pre_dec (reg/f:QI 32 sp)) [0 S1 A16]) (reg/f:QI 33 dp)) 12 {pushqi1} (nil)) (insn:TI 67 66 69 2 dotprod2.c:2 (set (reg/f:QI 33 dp) (reg/f:QI 32 sp)) 10 {*move_regs_qi} (nil)) (note 69 67 39 2 NOTE_INSN_PROLOGUE_END) (code_label 39 69 31 3 2 "" [1 uses]) (note 31 39 34 3 [bb 3] NOTE_INSN_BASIC_BLOCK) (note 34 31 32 3 NOTE_INSN_DELETED) (insn:TI 32 34 33 3 dotprod2.c:10 (set (reg:QI 19 a1h [67]) (mem:QI (post_inc:QI (reg/v/f:QI 1 r1 [orig:65 b ] [65])) [2 S1 A16])) 3 {*load_word_qi_with_post_inc} (expr_list:REG_INC (reg/v/f:QI 1 r1 [orig:65 b ] [65]) (nil))) (insn 33 32 35 3 dotprod2.c:10 (set (reg:QI 18 a1l [68]) (mem:QI (post_inc:QI (reg/v/f:QI 0 r0 [orig:64 a ] [64])) [2 S1 A16])) 3 {*load_word_qi_with_post_inc} (expr_list:REG_INC (reg/v/f:QI 0 r0 [orig:64 a ] [64]) (nil))) (insn 35 33 61 3 dotprod2.c:10 (set (reg/v:HI 16 a0l [orig:62 s ] [62]) (plus:HI (mult:HI (sign_extend:HI (reg:QI 19 a1h [67])) (sign_extend:HI (reg:QI 18 a1l [68]))) (reg/v:HI 16 a0l [orig:62 s ] [62]))) 23 {multacc} (expr_list:REG_DEAD (reg:QI 19 a1h [67]) (expr_list:REG_DEAD (reg:QI 18 a1l [68]) (nil (jump_insn:TI 61 35 75 3 dotprod2.c:8 (parallel [ (set (pc) (if_then_else (ne (reg:QI 2 r2 [70]) (const_int 1 [0x1])) (label_ref:QI 39) (pc))) (set (reg:QI 2 r2 [70]) (plus:QI (reg:QI 2 r2 [70]) (const_int -1 [0x]))) (use (const_int 255 [0xff])) (use (const_int 255 [0xff])) (use (const_int 1 [0x1])) ]) 43 {doloop_end_internal} (expr_list:REG_BR_PROB (const_int 9899 [0x26ab]) (nil))) (note 75 61 70 4 [bb 4] NOTE_INSN_BASIC_BLOCK) (note 70 75 72 4 NOTE_INSN_EPILOGUE_BEG) ... The loop body is not correctly scheduled, the TImode flags indicate that the entire loop-body will be executed in a single cycle as a VLIW packet and this will not work since no loop-prologue code has been emitted. I suspect your machine description says that dependency between loads and multiply-add has zero latency, thus allowing the scheduler to place them into one instruction group. Grep for various comments about tick_check_p function. In verbose scheduler dumps, there should be something like Expr 35 is not ready yet until cycle 2 No best expr found! Finished a cycle. Current cycle = 2 My (probably quite limited) understanding of what should happen is that: 1. the fence is placed at (bef
Re: libgomp forces -Werror even when top level configure --disable-werror
In the broader scope, there are two separate problems here. One is that libgomp does not honor --disable-werror indeed. However, --disable-werror, if it worked for libgomp, would be too big a hammer to work around the second (real) problem, and not quite useful for development builds anyway. The second issue is how libgomp configure detects availability of pthread_{attr,}_{get,set}affinity_np. It uses link test and does not pass -Wall -Werror to compiler; thus, libgomp build fails if those functions are declared in /usr/include/nptl/pthread.h but not in /usr/include/pthread.h (libgomp only includes the latter). This can be fixed by making configure use -Werror (I believe that adding --with-pthread= option is out of the question). Bootstraps on affected systems should probably use make CFLAGS_FOR_TARGET='-g -O2 -I/usr/include/nptl' as a clean workaround. -- Alexander Monakov
Re: Modulo Scheduling
On Tue, 2 Feb 2010, Cameron Lowell Palmer wrote: Does Modulo Scheduling work on x86 platforms? I have tried adding in various versions of the -fmodulo-sched option and get the exact same output with or without. The application is a very simplistic matrix multiply without dependencies. No, at present SMS is not able to schedule any loops on x86 at all. This is due to implementation detail: SMS operates on loops that end with decrement-and-branch instruction, and GCC does not generate such instructions on x86. Sorry. Alexander Monakov
Re: Understanding Scheduling
On Fri, 19 Mar 2010, Ian Bolton wrote: > Let's start with sched1 ... > > For our architecture at least, it seems like Richard Earnshaw is > right that sched1 is generally bad when you are using -Os, because > it can increase register pressure and cause extra spill/fill code when > you move independent instructions in between dependent instructions. Please note that Vladimir Makarov implemented register pressure-aware sched1 for GCC 4.5, activated with -fsched-pressure. I thought I should mention this because your e-mail omits it completely, so it's hard to tell whether you tested it. Best regards, Alexander Monakov
Re: bug linear loop transforms
[gcc-bugs@ removed from Cc:] On Mon, 29 Mar 2010, Alex Turjan wrote: > Im writing to you regarding a possible bug in linear loop transfor. > The bug can be reproduce by compiling the attached c file with gcc.4.5.0 > (20100204, 20100325) on x86 machine. > > The compiler flags that reproduce the error are: > -O2 -fno-inline -fno-tree-ch -ftree-loop-linear > > If the compiler is run with: > -O2 -fno-inline -fno-tree-ch -fno-tree-loop-linear > then the produced code is correct. Instead of writing to a mailing list, please file a bug in GCC Bugzilla, as described in http://gcc.gnu.org/bugs/ . Posting bug reports to gcc-bugs@ does not register them in the bugzilla, and thus is not recommended. Thanks. Alexander Monakov
Re: (known?) Issue with bitmap iterators
On Thu, 25 Jun 2009, Jeff Law wrote: I wasn't suggesting we make them "safe" in the sense that one could modify the bitmap and everything would just work. Instead I was suggesting we make the bitmap readonly for the duration of the iterator and catch attempts to modify the bitmap -- under the control of ENABLE_CHECKING of course. If that turns out to still be too expensive, it could be conditional on ENABLE_BITMAP_ITERATOR_CHECKING or whatever, which would normally be off. My biggest concern would be catching all the exit paths from the gazillion iterators we use and making sure they all reset the readonly flag. Ick... Unless I'm missing something, one can avoid that by checking whether the bitmap has been modified in the iterator increment function. To be precise, what I have in mind is: 1. Add bool field `modified_p' in bitmap structure. 2. Make iterator setup functions (e.g. bmp_iter_set_init) reset it to false. 3. Make functions that modify the bitmap set it to true. 4. Make iterator increment function (e.g. bmp_iter_next) assert !modified_p. This approach would also allow to modify the bitmap on the iteration ending with break, which IMHO is fine. -- Alexander Monakov
Re: (known?) Issue with bitmap iterators
On Fri, 26 Jun 2009, Joe Buck wrote: On Fri, Jun 26, 2009 at 03:38:31AM -0700, Alexander Monakov wrote: 1. Add bool field `modified_p' in bitmap structure. 2. Make iterator setup functions (e.g. bmp_iter_set_init) reset it to false. 3. Make functions that modify the bitmap set it to true. 4. Make iterator increment function (e.g. bmp_iter_next) assert !modified_p. Sorry, it doesn't work. Function foo has a loop that iterates over a bitmap. During the iteration, it calls a function bar. bar modifies the bitmap, then iterates over the bitmap. It then returns to foo, which is in the middle of an iteration, which it continues. The bitmap has been modified (by bar), but modified_p was reset to false by the iteration that happened at the end of bar. Good catch, thanks! Let me patch that up a bit: 1. Add int field `generation' in bitmap structure, initialized arbitrarily at bitmap creation time. 2. Make iterator setup functions save initial generation count in the iterator (or generation counts for two bitmaps, if needed). 3. Modify generation count when bitmap is modified (e.g. increment it). 4. Verify that saved and current generations match when incrementing the iterator.
Re: A question regarding bundling and NOPs insertion for VLIW architecture
On Tue, 11 May 2010, Revital1 Eres wrote: > > Hello, > > I have a question regarding the process of bundling and NOPs insertion for > VLIW architecture > and I appreciate your answer: > > I am calling the second scheduler from the machine reorg pass; similar to > what is done for IA64. > I now want to handle the bundling and NOPs insertion for VLIW architecture > with issue rate of 4 > and I want to make sure I understand the process: > > IIUC I can use the insns with TImode that the scheduler marked to indicate > a new cycle, so the > the question is how many nops to insert after that cycle, if any. > I noticed the following approach that was used in c6x which is mentioned > in: > http://archiv.tu-chemnitz.de/pub/2004/0176/data/index.html > > "NOP Insertion and Parallel Scheduling > If the scheduler is run, it checks dependencies and tries to schedule the > instructions as to > minimize the processing cycles. The hooks TARGET_SCHED_REORDER(2) are > considered > to reorder the instructions in the ready cue in case the back end wants to > override the > default rules. I used the hooks to memorize the program cycle the > instruction is scheduled. > This value is stored in a hash table I created for that purpose. From the > cycle information > I can later determine how many NOPs have to be inserted between two > instructions. This > value then overrides the attribute value." > > IA64 seems to have much more complicated approach for the bundling and NOPs > insertion and I wonder > if the reason is due to IA64 specific issues? or there is something I'm > missing in the approach > mentioned above? >From skimming the paper I understand that the target processor is a 4-wide VLIW with little or no instruction issue constraints (which insn type may go in which bundle slot) and uses a non-interlocked pipeline, thus requiring NOP insertion to avoid dependencies. IA64 is different in both regards. Bundling in ia64 is complicated because not all combinations of insn types are possible in a bundle (a bundle contains three insns), and instruction issue boundaries can appear in mid-bundles (ia64 architecture uses stop bits to indicate parallel issue boundaries, and there are some bundle kinds with a stop bit in between). Incidentally, ia64 does not need NOP insertion to avoid data dependency violation, because it uses scoreboarding to track register dependencies. Thus, NOP insertion is only needed to satisfy bundling constraints. I think the ia64 port in GCC uses dynamic programming to perform bundling because it would be much harder to extract the instruction placement from the automaton (which I think tracks all of the mentioned constraints internally). Alexander
Re: Massive performance regression from switching to gcc 4.5
Hi, On Fri, 25 Jun 2010, Jan Hubicka wrote: > I would be also very interested to know how profile feedback works in this > case > (and why it does not work in previous releases). Profiling multi-threading programs needs -fprofile-correction that appeared only in 4.4 (but I have no idea whether 4.4 works for Mozilla or not -- the initial message only speaks about 4.3 and 4.5). Mozilla code also triggered a bug in libgcov ( http://gcc.gnu.org/PR43825 ), and they have probably modified their code to never leave non-default alignment at the end of the TU (I have posted a patch for the libgcov bug [1], but it was not reviewed and does not apply anymore due to build_constructor changes). [1] http://gcc.gnu.org/ml/gcc-patches/2010-05/msg00292.html Cheers, Alexander
Re: CFG traversal
On Tue, 6 Jul 2010, Alex Turjan wrote: > Hi, > Is there functionality in gcc based on which the CFG can be traversed in > such a way that a node gets visited once all of its predecessors have been > visited? (Assuming you mean when there are no loops) Yes, see post_order_compute in cfganal.c. It computes topological sort order (which is what you need) in reverse: nodes that must be visited last come first in the array. Hope that helps. Alexander
Re: A question about doloop
On Mon, 26 Jul 2010, Revital1 Eres wrote: > > Hello, > > Doloop optimization fails to be applied on the following kernel from > tescase sms-4.c with mainline (-r 162294) due to 'Possible infinite > iteration > case' message; taken from the loop2_doloop dump. (please see below). > With an older version of gcc (-r 146278) doloop succeeded to be applied > and I appreciate an explanation about the change of behavior. This may be due to changes in induction variable selection and the ability of loop2_doloop to discover number of iterations for different variants of IV selection (which is not trivial when 'i' variable is eliminated and loop boundary is expressed with pointer comparison). See PR32283 [1] audit trail for an example of a related problem that was discussed before. It is possible that something is missing in simplify-rtx.c so that 'infinite if' condition cannot be simplified and proven to be always false. Zdenek once had to improve simplify-rtx.c for this reason, as the audit trail shows. [1] http://gcc.gnu.org/PR32283 Hope that helps, Alexander
Re: Restrict qualifier still not working?
On Mon, 2 Aug 2010, Bingfeng Mei wrote: > Hi, > I ran a small test to see how the trunk/4.5 works > with the rewritten restrict qualified pointer code. But it doesn't > seem to work on both x86-64 and our port. > > tst.c: > void foo (int * restrict a, int * restrict b, > int * restrict c, int * restrict d) > { > *c = *a + 1; > *d = *b + 1; > } [snip] > foo: > .LFB0: > .cfi_startproc > movl(%rdi), %eax > addl$1, %eax > movl%eax, (%rdx) > movl(%rsi), %eax > addl$1, %eax > movl%eax, (%rcx) > ret > > In the finally generated code, the second load should have > been moved before the first store if restrict qualifiers > are handled correctly. > > Am I missing something here? Thanks. The second load is moved for me with -fschedule-insns, -frename-registers or -fselective-scheduling2 (all of which are disabled by default on x86-64 -O2). Without those flags, second scheduler alone cannot lift the load due to dependency on %eax. Hope that helps. Alexander
RE: Restrict qualifier still not working?
On Tue, 3 Aug 2010, Bingfeng Mei wrote: > Thanks, I can reproduce it with trunk compiler but not 4.5.0. > Do you know how alias set are represented and used now. I'm not aware of any changes regarding alias sets. > It used to > be each alias set is assigned a unique number and there won't > be a dependence edge drawn between different alias set. Are you implying that restricted pointers would get different alias sets numbers before but don't anymore? I don't think that this might have changed but I may be mistaken (hopefully Richard can clarify this). > It seems not > to be the case anymore. [2 *a_1(D)+0 S4 A32] The second field > must play a role in disambiguate the memory access. Yes, this is MEM_EXPR which is used to invoke tree alias oracle from RTL. See mem_refs_may_alias_p and its invocations in {true,anti,write}_dependence. It should be much more precise than alias set numbers (but they are still used nevertheless). > BTW, why these two intermediate variables are both assigned to eax > without these non-default options? This example has no register pressure. > It looks like an issue with IRA. Well, RA is quite complicated even without considering issues like this. Thanks to Vladimir's pressure-sensitive scheduling patches, pre-RA scheduling should solve this. Alexander
Re: pipeline description
On Thu, 11 Nov 2010, Ian Lance Taylor wrote: > roy rosen writes: > > > If I have two insns: > > r2 = r3 > > r3 = r4 > > It seems to me that the dependency analysis creates a dependency > > between the two and prevent parallelization. Although there is a > > dependency (because of r3) I want GCC to parallelize them together. > > Since if the insns are processed together the old value of r3 is used > > for the first insn before it is written by the second insn. > > How do I let GCC know about these things (When exactly each operand is > > used and when it is written)? > > Is it in these hooks? > > In which port can I see a good example for that? > > I was under the impression that an anti-dependence in the same cycle was > permitted to execute. But perhaps I am mistaken. No, you are right. That is the norm when compiling for ia64, for example. I don't think the backend should specifically care about it: the anti-dependency gets zero latency, and the scheduler is able to issue the second instruction on the same cycle after issuing the first. Alexander
Re: Concerns regarding the -ffp-contract=fast default
On Thu, 14 Sep 2023, Florian Weimer via Gcc wrote: > While rebuilding CentOS Stream with -march=x86-64-v3, I rediscovered > several packages had test suite failures because x86-64 suddenly gained > FMA support. I say “rediscovered” because these issues were already > visible on other architectures with FMA. > > So far, our package/architecture maintainers had just disabled test > suites or had built the package with -fp-contract=off because the > failures did not reproduce on x86-64. I'm not sure if this is the right > course of action. > > GCC contraction behavior is rather inconsistent. It does not contract x > + x - x without -ffast-math, for example, although I believe it would be > permissible under the rules that enable FMA contraction. This whole > thing looks suspiciously like a quick hack to get a performance > improvement from FMA instructions (sorry). > > I know that GCC 14 has -fp-contract=standard. Would it make sense to > switch the default to that? If it fixes those package test suites, it > probably has an observable performance impact. 8-/ Note that with =standard FMA contraction is still allowed within an expression: the compiler will transform 'x * y + z' to 'fma(x, y, z)'. The difference between =fast and =standard is contraction across statement boundaries. So I'd expect some test suite failures you speak of to remain with =standard as opposed to =off. I think it's better to switch both C and C++ defaults to =standard, matching Clang, but it needs a bit of leg work to avoid regressing our own testsuite for targets that have FMA in the base ISA. (personally I'd be on board with switching to =off even) See https://gcc.gnu.org/PR106902 for a worked example where -ffp-contract=fast caused a correctness issue in a widely used FOSS image processing application that was quite hard to debug. Obviously -Ofast and -ffast-math will still imply -ffp-contract=fast if we make the change, so SPEC scores won't be affected. Thanks. Alexander
Re: Concerns regarding the -ffp-contract=fast default
Hi Florian, On Thu, 14 Sep 2023, Alexander Monakov wrote: > > On Thu, 14 Sep 2023, Florian Weimer via Gcc wrote: > > > While rebuilding CentOS Stream with -march=x86-64-v3, I rediscovered > > several packages had test suite failures because x86-64 suddenly gained > > FMA support. I say “rediscovered” because these issues were already > > visible on other architectures with FMA. > > > > So far, our package/architecture maintainers had just disabled test > > suites or had built the package with -fp-contract=off because the > > failures did not reproduce on x86-64. I'm not sure if this is the right > > course of action. > > > > GCC contraction behavior is rather inconsistent. It does not contract x > > + x - x without -ffast-math, for example, although I believe it would be > > permissible under the rules that enable FMA contraction. This whole > > thing looks suspiciously like a quick hack to get a performance > > improvement from FMA instructions (sorry). > > > > I know that GCC 14 has -fp-contract=standard. Would it make sense to > > switch the default to that? If it fixes those package test suites, it > > probably has an observable performance impact. 8-/ > > Note that with =standard FMA contraction is still allowed within an > expression: the compiler will transform 'x * y + z' to 'fma(x, y, z)'. > The difference between =fast and =standard is contraction across > statement boundaries. So I'd expect some test suite failures you speak of > to remain with =standard as opposed to =off. > > I think it's better to switch both C and C++ defaults to =standard, > matching Clang, but it needs a bit of leg work to avoid regressing > our own testsuite for targets that have FMA in the base ISA. > > (personally I'd be on board with switching to =off even) > > See https://gcc.gnu.org/PR106902 for a worked example where -ffp-contract=fast > caused a correctness issue in a widely used FOSS image processing application > that was quite hard to debug. > > Obviously -Ofast and -ffast-math will still imply -ffp-contract=fast if we > make the change, so SPEC scores won't be affected. Is this the sort of information you were looking for? If you're joining the Cauldron and could poll people about changing the default, I feel that could be helpful. One of the tricky aspects is what to do under -std=cNN, which implies -ffp-contract=off; "upgrading" it to =standard would introduce FMAs. Also, I'm a bit unsure what you were implying here: > I know that GCC 14 has -fp-contract=standard. Would it make sense to > switch the default to that? If it fixes those package test suites, it > probably has an observable performance impact. 8-/ The "correctness trumps performance" principle still applies, and -ffp-contract=fast (the current default outside of -std=cNN) is known to cause correctness issues and violates the C language standard. And -ffast[-and-loose]-math for is not going away. Thanks. Alexander
Re: Concerns regarding the -ffp-contract=fast default
On Mon, 18 Sep 2023, Florian Weimer via Gcc wrote: > x - x is different because replacing it with 0 doesn't seem to be a > valid contraction because it's incorrect for NaNs. x + x - x seems to > be different in this regard, but in our implementation, there might be a > quirk about sNaNs and qNaNs. Sorry, do you mean contracting 'x + x - x' to 'x'? That is not allowed due to different result and lack of FP exception for infinities. Contracting 'x + x - x' to fma(x, 2, -x) would be fine. Alexander