[Bug tree-optimization/94527] RFE: Add an __attribute__ that marks a function as freeing an object
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94527 Linus Torvalds changed: What|Removed |Added CC||torvalds@linux-foundation.o ||rg --- Comment #2 from Linus Torvalds --- One silly example of potential for dead store elimination would be kernel/async.c: async_run_entry_fn(), where we have /* 2) remove self from the pending queues */ spin_lock_irqsave(&async_lock, flags); list_del_init(&entry->domain_list); list_del_init(&entry->global_list); /* 3) free the entry */ kfree(entry); atomic_dec(&entry_count); and while it's good form to do "list_del_init()" on those fields in entry, the fact that we then do "kfree(entry)" right afterwards means that the stores that re-initialize the entry list are dead. If gcc knew that "kfree(entry)" de-allocates the entry pointer, it could remove them, the same way gcc already removes dead stores to automatic variables. But again: warnings about mis-use are likely more important than DSE. We have had the silly kinds of bugs where we move code around, and some access remains after a free. We have good tools (both static and dynamic) to find those after-the-fact, of course, but the compiler warning about stupidity is even better.
[Bug tree-optimization/94527] RFE: Add an __attribute__ that marks a function as freeing an object
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94527 --- Comment #4 from Linus Torvalds --- (In reply to Jeffrey A. Law from comment #3) > GCC already knows that free() "kills" the pointed-to memory and should be > doing DSE with that in mind. It doesn't however know that other functions > have free-like semantics, so it wouldn't do so in for kfree. Oh, ok, so the logic already exists, just not the interface to tell anybody else. I suspect even non-kernel users might have wrappers around free that might be able to use a "this acts like free()" marker. > With regard to the warnings. When we were investigating use-after-free and > double-free diagnostics it was our conclusion that do to any kind of > reasonable job you really have to do a whole program analysis. Otherwise > it's just a toy. As a result the focal point for those diagnostics is the > static analyzer David Malcolm is working on. Obviously a static analyzer is better. That said, we've had some stupid bugs wrt kfree(). Things like releasing things twice in error paths etc. So yeah, doing it in the compiler isn't going to catch the subtle cases, but catching the stupid cases early would still be a good thing. But I also realize that it might not be worth it to you guys. Since you already effectively have the DSE code, that looks like a much cheaper thing to do. (And maybe one day somebody will go "I can trivially see use-after-free things too, and warn about it", so just having the marker might result in the warnings at some point too).
[Bug middle-end/94527] RFE: Add an __attribute__ that marks a function as freeing an object
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94527 --- Comment #8 from Linus Torvalds --- (In reply to Jonathan Wakely from comment #6) > I can see uses that aren't just "frees the memory", e.g. after fclose and > close any further uses of their argument are probably errors. The close case > is interesting because it's not a pointer argument. I suspect you'll find that if it's not a pointer argument, it ends up being not very useful. To take your "close()" example: sure, after calling close, the fd isn't active any more, and you can't use it for IO. So it's similar to free() in many respects: the lifetime is over and done with as a file descriptor. But it's very different in other ways. For example, it is _entirely_ valid to do something like this: close(fd); mark_fd_freed(fd); where the application keeps track of free fd ranges on its own (or it keeps track of a set of file descriptors it is polling, or whatever). So you can still validly use the 'fd' value even after you closed it, in ways that is not really the case with pointers from memory allocators and free(). Dereferencing a pointer after freeing it is basically _always_ a bug, but using a file descriptor after closing it might not be.
[Bug inline-asm/94522] Enhancement request: asm goto with outputs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94522 Linus Torvalds changed: What|Removed |Added CC||torvalds@linux-foundation.o ||rg --- Comment #4 from Linus Torvalds --- (In reply to Andrew Pinski from comment #1) > Take: > asm goto ("" : "=r"(*m) :); > > Where does the store to *m happen? Do you replicate it on to the label side > too? > > What are the semantics for the above case on clang? My bet is it is not > well defined and really broken. The outputs are defined to have values only in the fallthrough case. On the label side, the outputs (whether memory or register) are explicitly undefined. The outputs may obviously not even be in scope except in the fallthrough. Now, with memory ops, it's obviously quite possible that the programmer then knows that they wrote an inline asm that did the write to memory before it did the goto. But that's no different from a _non_output_ "asm goto" that you pass a memory pointer to, so I don't think that's something that is new to this situation. As to whether this is simple or not to do in gcc - the clang implementation has been buggy so far, altough it's good enough for testing ;)
[Bug inline-asm/94522] Enhancement request: asm goto with outputs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94522 --- Comment #5 from Linus Torvalds --- Btw, Nick (who is doing this on the clang side, tells me that the tcmalloc people are looking at using the asm goto with outputs too, so it's not just the kernel. If somebody wants to play with it, I do have a patch to use it in the kernel as a test-case. HOWEVER - I'm working on cleaning up some of the infrastructure around it, but at least for now, that patch is a "all or nothing" thing: it unconditionally requires asm goto w/ inputs support, so there's no fallback to "older compiler without asm goto with inputs" codepath. With the cleanups, I hope to have a patch that can swing both ways, but right now it's very much for testing only, and I don't want to make that patch too public because it will just break for anybody with a non-experimental compiler.
[Bug c/89501] New: Odd lack of warning about missing initialization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89501 Bug ID: 89501 Summary: Odd lack of warning about missing initialization Product: gcc Version: 8.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: torva...@linux-foundation.org Target Milestone: --- Created attachment 45820 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45820&action=edit You can compile this to see a lack of warning. We had a simple kernel patch that introduced a stupid bug due to an uninitialized variable, and while we got it all sorted out and the fix was trivial, people were surprised by the lack of warning for the uninitialized case. I'm adding a test-case as an attachment that largely matches the kernel code that didn't warn. But it boils down to a pattern of int ret; /* UNINITIALIZED */ if (somecondition) { ret = functioncall(x); if (ret) return ret; } .. some more work .. return ret; /* Possibly uninitialized return value! */ What I *suspect* happens is (a) gcc sees that there is only one assignment to "ret" (b) in the same basic block as the assignment, there is a test against "ret" being nonzero that goes out. and what I think happens is that (a) causes gcc to consider that assignment to be the defining assignment (which makes all kinds of sense in an SSA world), and then (b) means that gcc decides that clearly "ret" has to be zero in any case that doesn't go out due to the if-test. So it turns out that gcc almost by mistake generates code that works (and doesn't warn about it, exactly because it works), even though the source code was clearly buggy. The attached test-case is stupid but intentionally made to be as close to the kernel source case as possible. With it, I can do: Look, ma: no warning: gcc -O2 -S -Wall test.c but this gives the expected warning: gcc -O2 -S -Wall -DHIDE_PROBLEM test.c regardless, this is not a huge problem for us, but I decided to make a report since we have a test case, and maybe somebody gets excited about it. Thanks, Linus
[Bug middle-end/89501] Odd lack of warning about missing initialization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89501 --- Comment #2 from Linus Torvalds --- (In reply to Andrew Pinski from comment #1) > I think it comes down to the same issue as PR 18501. Very possibly the same issue in just a different guise. NOTE! I have in the meantime verified that yes, it does seem to be about the pattern int x; if (somecondition) { x = something(); if (x != XYZ) return x; } return x; where gcc seems to turn the "if (x != XYZ) return x" to mean that "x" clearly _has_ to be XYZ elsewhere. If I change my kernel-based test-case to do if (ret != 1) return ret; instead of the original if (ret) return ret; then gcc will actually generate code that ends with movl$1, %eax popq%rbp popq%r12 ret ie it will basically consider "ret" to be initialized to that value "1", even if the basic block that assigned it was never actually executed. Knowing how SSA works, I'm not entirely surprised, but obviously if you'd like to see the warning about buggy source code, it's less than optimal. Anyway, this shouldn't be a high priority, but it does strike me as a potentially fairly common pattern that people might be missing warnings for. Linus
[Bug middle-end/89501] Odd lack of warning about missing initialization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89501 --- Comment #6 from Linus Torvalds --- (In reply to Richard Biener from comment #5) > > And this meeting helps us avoid bogus warnings for cases where GCC has > difficulties proving dead code paths are actually dead ... Ack. I do see the difficulty. We already disable some of the "may-be-uninitialized" warnings in the kernel because they end up being too noisy for some configurations. NP. If you guys figure something out, it will be good. If not, we'll survive.
[Bug middle-end/89501] Odd lack of warning about missing initialization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89501 --- Comment #8 from Linus Torvalds --- (In reply to Jeffrey A. Law from comment #7) > It's reliably the case that a false positive uninit warning also represents > a failure to optimize something. So we've got significant incentives to > deeply analyze and look for fixes. So feel free to pass examples of those > along. Well, most of it is due to interactions with *other* issues entirely. For example, when we enable GCOV for coverage checking, we have to disable tree-loop-im, because of excessive stack usage due to gcc bugzilla 69702. And disabling that optimization then leads to bogus "might be used uninitialized" warnings. We have a few other cases like that. Eg we can't use -Os without also disabling -Wmaybe-uninitialized etc. Some of the cases may be historical (ie maybe you've fixed the issues that cause us to do that in newer versions), but for various distro reasons we end up supporting old compilers for a _looong_ time. We not that long ago ended up increasing the minimum required gcc version for the kernel to gcc-4.6. So we have this odd "we love using new features" fight with "oops, but we end up having people using relatively ancient compiler versions".
[Bug middle-end/89501] Odd lack of warning about missing initialization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89501 --- Comment #10 from Linus Torvalds --- (In reply to ncm from comment #9) > What I don't understand is why it doesn't optimize away the check on > (somecondition), since it is assuming the code in the dependent block always > runs. No, it very much doesn't assume that. The 'somecondition' really is dynamic. What happens is simply that because gcc sees only a single assignment to the variable, and that assignment is then limited by the subsequent value test to a single value, gcc will just say "ok, any other place where that variable is used, just use the known single value". And it does that whether the 'if (somecondition)' path was taken or not. It's a perfectly valid optimization. In fact, it's a lovely optimization, I'm not at all complaining about the code generation. It's just that as part of that (quite reasonable) optimization it also optimized away the whole "oops, there wasn't really a valid initialization in case the if-statement wasn't done". Obviously that's undefined behavior, and the optimization is valid regardless, but the lack of warning means that we didn't see that we had technically undefined behavior that the compiler has silently just fixed up for us. I think the cause of this all is quite reasonable and understandable, and I also see why gcc really does want to throw away the undefined case entirely (because otherwise you can get into the reverse situation where you warn unnecessarily, because gcc isn't smart enough to see that some undefined case will never ever actually happen). Plus I assume it simplifies things a lot to just not even have to track the undefined case at all. You can just track "ok, down this path we have a known value for this SSA, and we don't need to keep track of any inconvenient phi nodes etc".
[Bug middle-end/89501] Odd lack of warning about missing initialization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89501 --- Comment #12 from Linus Torvalds --- (In reply to Jeffrey A. Law from comment #11) > > More generally we have considered whether or not we could eliminate the > control dependent path which leads to undefined behavior. But you have to > be careful because statements on the path prior to the actual undefined > behavior may have observable side effects. Note that for the kernel, we consider those kinds of "optimizations" completely and utterly wrong-headed, and will absolutely refuse to use a compiler that does things like that. It's basically the compiler saying "I don't care what you meant, I can do anything I want, and that means I will screw the code up on purpose". I will personally switch the kernel immediately to clang the moment we cannot turn off idiotic broken behavior like that. We currently already disable 'strict-aliasing', 'strict-overflow' and 'delete-null-pointer-checks'. Because compilers that intentionally create garbage code are evil and wrong. Compiler standards bodies that add even more of them (the C aliasing rules being the prime example) should be ashamed of themselves. And compiler people that think it's ok to say "it's undefined, so we do anything we damn well like" shouldn't be working with compilers IMNSHO. If you know something is undefined, you warn about it. You don't silently generate random code that doesn't match the source code just because some paper standard says you "can".
[Bug c/61904] New: Incorrect stack red-zoning on x86-64 code generation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61904 Bug ID: 61904 Summary: Incorrect stack red-zoning on x86-64 code generation Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: critical Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: torva...@linux-foundation.org gcc-4.9.0 in Debian seems to miscompile the linux kernel for x86-64 in certain configurations, creating accesses to below the stack pointer even though the kernel uses -mno-red-zone. The kernel cannot use the x86-64 stack red-zoning, because the hardware only switches stacks on privilege transfers, so interrupts that happen in kernel mode will not honor the normal 128-byte stack red-zone. Attached is the pre-processed C code of the current kernel file kernel/sched/fair.c which apparently on gcc-4.9.0 will miscompile the function "load_balance()", creating code like this: load_balance: .LFB2408: .loc 2 6487 0 .cfi_startproc .LVL1355: pushq %rbp# .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq%rsp, %rbp #, .cfi_def_cfa_register 6 pushq %r15# pushq %r14# pushq %r13# pushq %r12# .cfi_offset 15, -24 .cfi_offset 14, -32 .cfi_offset 13, -40 .cfi_offset 12, -48 movq%rdx, %r12 # sd, sd pushq %rbx# .LBB2877: .loc 2 6493 0 movq$load_balance_mask, -136(%rbp) #, %sfp .LBE2877: .loc 2 6487 0 subq$184, %rsp #, .cfi_offset 3, -56 .loc 2 6489 0 Note the "subq$184, %rsp" *after* the compiler has already spilled to the stack (the spill is insane, btw, since it's spilling a constant value!) The second attachement is the reported mis-compiled result. I don't personally have the affected gcc version, but you can see the options passed into the compiler in the resulting "fair.s" file. The "-Os" in particular seems to be important, with the bug not happening with "-O2".
[Bug c/61904] Incorrect stack red-zoning on x86-64 code generation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61904 --- Comment #1 from Linus Torvalds --- Created attachment 33183 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33183&action=edit Buggy result of compilation This was sent by the reporter of the kernel problem, Michel Dänzer , at my request. It's gzipped to fit the file size limit.
[Bug target/61904] Incorrect stack red-zoning on x86-64 code generation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61904 --- Comment #3 from Linus Torvalds --- Created attachment 33184 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33184&action=edit gzipped pre-processed fair.i file Apparently attaching a file during the initial bug entry creation failed, probably because it failed the size check. So here's the fair.i file.
[Bug target/61904] Incorrect stack red-zoning on x86-64 code generation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61904 --- Comment #4 from Linus Torvalds --- As I already mentioned to Jakub Jelinek separately in the original email thread about the kernel problem: "Note that I don't personally have a reproducer (my machine has gcc-4.8.3, and I don't see the same behavior), but I included the incorrect fair.s file that Michel sent me (which has all the command line options in it), and a pre-processed "fair.i" source file that I generated and that *should* match the configuration that was the source for that result. So there might be some version/configuration skew there between the two files, but I think they match. Holler if you cannot reproduce the problem based on that." so if the attached *.i and resulting buggy *.s files don't match up perfectly, that's the explanation.
[Bug target/61904] Incorrect stack red-zoning on x86-64 code generation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61904 --- Comment #7 from Linus Torvalds --- Just for completeness for the original bug report, here's the actual compiler command line that the kernel uses to generate the *.s file. gcc -Wp,-MD,kernel/sched/.fair.s.d -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/4.8.3/include -I./arch/x86/include -Iarch/x86/include/generated -Iinclude -I./arch/x86/include/uapi -Iarch/x86/include/generated/uapi -I./include/uapi -Iinclude/generated/uapi -include ./include/linux/kconfig.h -D__KERNEL__ -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -Werror-implicit-function-declaration -Wno-format-security -m64 -mno-mmx -mno-sse -mno-80387 -mno-fp-ret-in-387 -mpreferred-stack-boundary=3 -mtune=generic -mno-red-zone -mcmodel=kernel -funit-at-a-time -maccumulate-outgoing-args -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -DCONFIG_AS_CFI_SECTIONS=1 -DCONFIG_AS_FXSAVEQ=1 -DCONFIG_AS_CRC32=1 -DCONFIG_AS_AVX=1 -DCONFIG_AS_AVX2=1 -pipe -Wno-sign-compare -fno-asynchronous-unwind-tables -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -fno-delete-null-pointer-checks -Os -Wno-maybe-uninitialized -Wframe-larger-than=2048 -fno-stack-protector -Wno-unused-but-set-variable -fno-omit-frame-pointer -fno-optimize-sibling-calls -g -Wdeclaration-after-statement -Wno-pointer-sign -fno-strict-overflow -fconserve-stack -Werror=implicit-int -Werror=strict-prototypes -DCC_HAVE_ASM_GOTO-D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(fair)" -D"KBUILD_MODNAME=KBUILD_STR(fair)" -fverbose-asm -S -o kernel/sched/fair.s kernel/sched/fair.c although looking at the generated .s file in the attachment, I really think it's all there too in the comments at the top of the file.
[Bug target/61904] Incorrect stack red-zoning on x86-64 code generation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61904 --- Comment #8 from Linus Torvalds --- Oh, and this is marked as a duplicate of 61801, but that one is marked to be in gcc-4.8.3 The particular problem we see in kernel code generation seems to *not* happen in 4.8.3 (current fedora 20), and happens with Debian 4.9.0. So now I worry about (a) whether the duplicate bug is really true (b) or perhaps 4.8.3 really does have the same problem and we might get bitten there too, and it just happens to trigger on 4.9.0 for some otherwise unrelated reason. I'd like to have some way to tell known-bad compilers, so that we know to avoid them..
[Bug target/61904] Incorrect stack red-zoning on x86-64 code generation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61904 --- Comment #10 from Linus Torvalds --- (In reply to Andrew Pinski from comment #9) > > The bad compiler versions are 4.5.0 (when debug_insn came in) to 4.8.3 and > 4.9.0 and 4.9.1. Ok, so we have no reasonable way of avoiding the problem compiler version. I had hoped that we could just do a warning if people use 4.9.0/1, since they aren't very common yet. Ugh. We had one suggestion of having some post-compile checking pass, but that one was (so far) just handwaving ("objdump + perl-script"). It doesn't sound very pleasant. The problem is that these things are a bitch to debug - they turn into these completely impossible kernel oopses or corruption, and we were just very lucky that this one case happened to be repeatable and pinpoint for two people. Are there others? We have no way of knowing.. Anyway, thanks for the quick resolution, even if I'm now rather nervous about existing compilers..
[Bug other/48696] New: Horrible bitfield code generation on x86
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48696 Summary: Horrible bitfield code generation on x86 Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: other AssignedTo: unassig...@gcc.gnu.org ReportedBy: torva...@linux-foundation.org gcc (tried 4.5.1 and 4.6.0) generates absolutely horrid code for some common bitfield accesses due to minimizing the access size. Trivial test case: struct bad_gcc_code_generation { unsigned type:6, pos:16, stream:10; }; int show_bug(struct bad_gcc_code_generation *a) { a->type = 0; return a->pos; } will generate code like this on x86-64 with -O2: andb$-64, (%rdi) movl(%rdi), %eax shrl$6, %eax movzwl%ax, %eax ret where the problem is that byte access write, followed by a word access read. Most (all?) modern x86 CPU's will come to a screeching halt when they see a read that hits a store buffer entry, but cannot be fully forwarded from it. The penalty can be quite severe, and this is _very_ noticeable in profiles. This code would be _much_ faster either using an "andl" (making the store size match the next load, and thus forwarding through the store buffer), or by having the load be done first. (The above code snippet is not the real code I noticed it on, obviously, but real code definitely sees this, and profiling shows very clearly how the 32-bit load from memory basically stops cold due to the partial store buffer hit) Using non-native accesses to memory is fine for loads (so narrowing the access for a pure load is fine), but for store or read-modify-write instructions it's almost always a serious performance problem to try to "optimize" the memory operand size to something smaller. Yes, the constants often shrink, but the code becomes *much* slower unless you can guarantee that there are no loads of the original access size that follow the write.
[Bug other/48696] Horrible bitfield code generation on x86
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48696 --- Comment #1 from Linus Torvalds 2011-04-20 04:28:18 UTC --- Side note, this is talked about in the Intel optimization manual 3.6.4 Store Forwarding Under "General Optimization Guidelines" -> "Optimizing Memory Accesses" -> "Store Forwarding". I'm sure people are aware of it, but I'm not certain people realize how horrible the performance hit can be. I have to admit to having been a bit surprised myself just _how_ clearly it shows up in profiles.
[Bug rtl-optimization/48696] Horrible bitfield code generation on x86
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48696 --- Comment #7 from Linus Torvalds 2011-04-20 15:30:17 UTC --- (In reply to comment #2) > > I'm not sure where to best address this, rather than throwing in again > the idea of lowering bitfield accesses early on trees. So my gut feel is that getting rid of the bitfield as early as possible, and turning all bitfield accesses into regular load/shift/mask/store operations is always the right thing to do. I also think that doing it with the size that the user specified is generally a good idea, ie I sincerely hope that gcc hasn't thrown away the "unsigned int" part of the type when it does the lowering of the bitfield op. If gcc has forgotten the underlying type, and only looks at the bitfield size and offset, gcc will likely never do a good job at it unless gcc gets _really_ smart and looks at all the accesses around it and decides "I need to do these as 'int' just because (ie in the example, the "unsigned" base type is as important as is the "bits 0..5" range information). So I suspect it's better to just do a totally mindless expansion of bitfield accesses early, and then use all the regular optimizations on them. Rather than keep them around as bitfields and try to optimize at some higher level. In an ironic twist, the real program that shows this optimization problem is "sparse" (the kernel source code checker), which can actually do a "linearize and optimize" the test-case itself, and in this case does this all better than gcc (using its "dump the linearized IR" test-program): [torvalds@i5 ~]$ ./src/sparse/test-linearize test.c test.c:7:5: warning: symbol 'show_bug' was not declared. Should it be static? show_bug: .L0x7f4cf7b93010: load.32 %r2 <- 0[%arg1] and.32 %r3 <- %r2, $-64 store.32%r3 -> 0[%arg1] lsr.32 %r7 <- %r3, $6 cast.32 %r8 <- (16) %r7 ret.32 %r8 Heh. Sparse may get a lot of other things wrong, but it got this particular case right.
[Bug rtl-optimization/48696] Horrible bitfield code generation on x86
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48696 --- Comment #11 from Linus Torvalds 2011-04-20 16:16:52 UTC --- (In reply to comment #8) > > Unfortunately the underlying type isn't easily available (at least I didn't > yet find it ...). But I suppose we have to guess anyway considering > targets that don't handle unaligned accesses well or packed bitfields. > Thus, an idea was to use aligned word-size loads/stores and only at the > start/end of a structure fall back to smaller accesses (for strict align > targets). That sounds fine. The only reason to bother with the "underlying type" is that I suspect it could be possible for educated programmers to use it as a code generation hint. IOW, if all the individual fields end up fitting nicely in "char", using that as a base type (even if the _total_ fields don't fit in a single byte) might be a good hint for the compiler that it can/should use byte accesses and small constants. But using the biggest aligned word-size is probably equally good in practice. And if you end up narrowing the types on _reads_, I think that's fine on x86. I forget the exact store buffer forwarding rules (and they probably vary a bit between different microarchitectures anyway), but I think almost all of them support forwarding a larger store into a smaller (aligned) load. It's just the writes that should normally not be narrowed. (Of course, sometimes you may really want to narrow it. Replacing a andl $0xff00,(%rax) with a simple movb $0,(%rax) is certainly a very tempting optimization, but it really only works if there are no subsequent word-sized loads that would get fouled by the write buffer entry.
[Bug other/49095] New: Horrible code generation for trivial decrement with test
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095 Summary: Horrible code generation for trivial decrement with test Product: gcc Version: 4.5.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: other AssignedTo: unassig...@gcc.gnu.org ReportedBy: torva...@linux-foundation.org This trivial code: extern void fncall(void *); int main(int argc, char **argv) { if (!--*argv) fncall(argv); return 0; } compiles into this ridiculous x86-64 assembly language: movq(%rsi), %rax subq$1, %rax testq%rax, %rax movq%rax, (%rsi) je.L4 for the "decrement and test result" at -O2. I'd have expected that any reasonable compiler would generate something like decq(%rsi) je.L4 instead, which would be smaller and faster (even a "subq $1" would be fine, but the decq is one byte shorter). The problem is more noticeable when the memory location is a structure offset, when the "load+decrement+store" model really results in relatively much bigger code due to the silly repetition of the memory address, for absolutely no advantage. Is there some way that I haven't found to make gcc use the rmw instructions?
[Bug rtl-optimization/49095] Horrible code generation for trivial decrement with test
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095 --- Comment #2 from Linus Torvalds 2011-05-21 18:41:15 UTC --- (In reply to comment #1) > > On the RTL side combine tries to do > > Trying 7, 8 -> 9: > Failed to match this instruction: > (parallel [ > (set (mem/f:DI (reg/v/f:DI 63 [ argv ]) [2 *argv_1(D)+0 S8 A64]) > (plus:DI (mem/f:DI (reg/v/f:DI 63 [ argv ]) [2 *argv_1(D)+0 S8 > A64]) > (const_int -1 [0x]))) > (set (reg/f:DI 60 [ D.2723 ]) > (plus:DI (mem/f:DI (reg/v/f:DI 63 [ argv ]) [2 *argv_1(D)+0 S8 > A64]) > (const_int -1 [0x]))) > ]) > > because we have a use of the decrement result in the comparison. It doesn't > try to combine this with the comparison though. Why isn't there a trivial pattern for the combination of "add+cmp0"? It sounds like a peephole optimization to me. > So this case is really special ;) Without the use of the decremented > value we get the desired subq $1, (%rsi). The whole notion of "decrement and check if zero" is just about as special as mud. And I realize that without the "check if zero" part I get the single rmw instruction, but I was really hoping that gcc would get this kind of really obvious code right. There is absolutely no question about what the correct result is, and gcc simply doesn't generate it. I'm used to gcc sometimes being confused by more complicated things (inline asms, bitfields etc), but this is really basic code. The load-store model is fine for a Pentium 4 - those things were not very good at complex instructions. But it generates horribly big code, and modern x86 chips all want the "operate on memory" version. > Manually sinking the store to *argv into the if and the else yields Yeah. And that's pretty horrible. > As usual combine doesn't like stores. Is there some reason this can't just be a peephole pattern? I really thought that gcc has done this before. Linus
[Bug rtl-optimization/49095] Horrible code generation for trivial decrement with test
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095 --- Comment #3 from Linus Torvalds 2011-05-21 20:42:26 UTC --- Hmm. Looking at that code generation, it strikes me that even with the odd load store situation, why do we have that "test" instruction? c:8b 10mov(%eax),%edx e:83 ea 01 sub$0x1,%edx 11:85 d2test %edx,%edx 13:89 10mov%edx,(%eax) 15:74 09je 20 iow, regardless of any complexities of the store, that "sub + test" is just odd. Gcc knows to simplify that particular sequence in other situations, why doesn't it simplify it here? IOW, I can make gcc generate code like c:83 e8 01 sub$0x1,%eax f:75 07jne18 with no real problem when it's in registers. No "test" instruction after the sub. Why does that store matter so much? It looks like the combine is bring driven by the conditional branch, and then when the previous instruction from the conditional branch is that store, everything kind of goes to hell. Would it be possible to have a peephole for the "arithmetic/logical + compare-with-zero" case (causing us to just drop the compare), and then have a separate peephole optimization that triggers the "load + op + store with dead reg" and turns that into a "op to mem" case? The MD files do make me confused, so maybe there is some fundamental limitation to the peephole patterns that makes this impossible?
[Bug rtl-optimization/49095] Horrible code generation for trivial decrement with test
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095 --- Comment #6 from Linus Torvalds 2011-05-27 14:15:25 UTC --- Jakub - the patch looks sane, although I don't currently have a gcc build environment to actually test it with, and there is no way I'm going to claim that I follow all the matches and understand why you have that third pattern with SWI12 (but I'm guessing it's because the op and the test are being done in the "explicit cast to integer" mode). One thing I wanted to check, though: I hope everybody is aware that op $x,mem is not identically the same as mov mem,reg op $x, reg mov reg,mem test reg WRT the carry flag. The "test" version will always clear the carry flag, while the "op" version will obviously set it according to the "op". For the logical ops, this isn't a problem (they also clear carry), but for "add" and "sub" this transformation *will* result in a difference in the C flag. Now, "carry" is meaningless when talking about compare with zero, so this should all be trivially ok. But I bring it up because it is possible that gcc perhaps still uses the "unsigned compare" versions with zero. In particular, the thing to look out for would be code like unsigned int *a; if (--*a > 0) do_something(); and if the comparison uses "jg" ("unsigned greater than") for the comparison, it will get the wrong end result. Now, unsigned compares against zero are always expressible as "ne" or "eq", so this is not a fundamental problem. In fact, my trivial testing with a few cases seems to imply that gcc already does that conversion to inequality, and you'd obviously have exactly the same issue with eliding the "test" instruction for the cases you already handle (when things are in registers). So I think everything is fine - but I wanted to mention it explicitly in case it makes you go "ooh, yes, there are cases when this is an issue"
[Bug other/49194] New: Trivially stupid inlining decisions
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49194 Summary: Trivially stupid inlining decisions Product: gcc Version: 4.5.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: other AssignedTo: unassig...@gcc.gnu.org ReportedBy: torva...@linux-foundation.org So gcc inlining heuristics seem to include the rule that if it's called from a single call-site, and is static, then it should be inlined. That is actually really horrible for some cases. In particular, if you have a small function that has an uncommon case handled by a much bigger function, you absolutely do *not* want to inline the big function because it ends up creating a total monster of a stack frame - something that the small function on its own doesn't need. So for example, in the kernel we have various of these kinds of situations where we decrement a refcount or a lock, and then in the unlikely situation that something is true, we need to so much more processing as a result. An example of this would be "__rcu_read_unlock()", which basically does - decrement the per-thread "rcu_read_lock_nesting" variable - if that variable is now zero, *and* we have a pending RCU event, we need to do some special complicated stuff. The code is basically written (we have various barriers and helper macros etc that are irrelevant to the inlining issue) as --t->rcu_read_lock_nesting; if (t->rcu_read_lock_nesting == 0 && __builtin_expect(!!t->rcu_read_unlock_special, 0)) rcu_read_unlock_special(t); (where "t" is the thread pointer). And that's all. However, because gcc inlines the special case, the function prologue ends up looking like this (that "current_task" is from our inline asm, it's just us getting the thread variable): __rcu_read_unlock: pushq %rbp# movq%rsp, %rbp #, subq$48, %rsp #, movq%r13, -24(%rbp) #, movq%rbx, -40(%rbp) #, movq%r12, -32(%rbp) #, movq%r14, -16(%rbp) #, movq%r15, -8(%rbp) #, movq %gs:current_task,%r13 #, t decl256(%r13) # t_15->rcu_read_lock_nesting ... which is pretty horrid. It uses a lot of stack, and has stupid useless instructions. Now, I can just mark that rcu_read_unlock_special() function as "noinline", and as a result I get this: __rcu_read_unlock: pushq %rbp# movq%rsp, %rbp #, movq %gs:current_task,%rdi #, t decl256(%rdi) # t_15->rcu_read_lock_nesting ... which is obviously much superior code for the case that actually matters. So rather than have to find all of these things manually (yes, those things do actually show up on profiles - the stack expansion in particular ends up showing up as extra costs), maybe gcc could just have a simple check: - if the call is in an unlikely section - .. and the callee is much bigger (in stack frame or whatever) than the caller - .. simply don't inline Hmm? I realize that this may sound specialized, but I've been looking at kernel profiles for the last few weeks now, and this particular issue has come up in two independent hotspots. It turns out that excessive stack use is *expensive*. It's not just the normal "the kernel stack is limited", it's actually a matter of "the L1 D$ is pretty small, and a big stack frame actually causes real costs". I really didn't expect register saves on the stack to show up in my profiles, but they actually do. I expect the stack to basically always be in the L1 D$ and dirty (so that an access to it should be almost free), but that is only true for a _dense_ stack. Not for leaf functions that then have extra stack frame wasting code in them. Comments?
[Bug rtl-optimization/49095] Horrible code generation for trivial decrement with test
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095 --- Comment #8 from Linus Torvalds 2011-05-27 15:32:21 UTC --- (In reply to comment #7) > > BTW, the patch bootstrapped/regtested on both x86_64-linux and i686-linux, I'm > just running second set of bootstrap without the patch to be able to compare > how much the patch changed code on gcc itself. I'd love to hear how well it works - I'm in the middle of a busy merge window and about to leave for a trip to Japan, otherwise I'd just try to set up a gcc build myself right now. (Actually, I have a copy of a git tree, but that one fails with /usr/bin/ld: ../libiberty/libiberty.a(argv.o): relocation R_X86_64_32S against `_sch_istable' can not be used when making a shared object; recompile with -fPIC and due to the merge window I don't really have time to try to figure it out) Thanks, Linus
[Bug other/49194] Trivially stupid inlining decisions
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49194 --- Comment #6 from Linus Torvalds 2011-05-27 16:38:22 UTC --- (In reply to comment #3) > > -finline-functions-called-once is trottled down by the large-function-growth > and large-stack-frame-growth limits. The Kernel case coupld proably be > handled > by the second. Does kernel bump down that limits? We used to play with inlining limits (gcc had some really bad decisions), but the meaning of the numbers kept changing from one gcc version to another, and the heuristics gcc used kept changing too. Which made it practically impossible to use sanely - you could tweak it for one particular architecture, and one particular version of gcc, but it would then be worse for others. Quite frankly, with that kind of history, I'm not very eager to start playing around with random gcc internal variables again. So I'd much rather have gcc have good heuristics by default, possibly helped by the kinds of obvious hints we can give ("unlikely()" in particular is something we can add for things like this). Obviously, we can (and do) use the "force the decision" with either "noinline" or "__always_inline" (which are just the kernel macros to make the gcc attribute syntax slightly more readable), but since I've been doing those other bug reports about bad gcc code generation, I thought I'd point out this one too. > It still won't help in case function doesn't have any on-stack aggregates, > since we optimistically assume that all gimple registers will disappear. > Probably > even that could be change, though estimating reload's stack frame usage so > early would > be iffy. Yes, early stack estimation might not work all that well. In the kernel, we do end up having a few complex functions that we basically expect to inline to almost nothing - simply because we end up depending on compile-time constant issues (sometimes very explicitly, with __builtin_constant_p() followed by a largish "switch ()" statement). That said, this is something where the call-site really can make a big difference. Not just the fact that the call site might be marked "unlikely()" (again, that's just the kernel making __builtin_expect() readable), but things like "none of the arguments are constants" could easily be a good heuristic to use as a basis for whether to inline or not. IOW, start out with whatever 'large-stack-frame-growth' and 'large-function-growth' values, but if the call-site is in an unlikely region, cut those values in half (or whatever). And if none of the arguments are constants, cut it in half again. This is an example of why giving these limits as compiler options really doesn't work: the choice should probably be much more dynamic than just a single number. I dunno. As mentioned, we can fix this problem by just marking things noinline by hand. But I do think that there are fairly obvious cases where inlining really isn't worth it, and gcc might as well just get those cases right.
[Bug rtl-optimization/49095] Horrible code generation for trivial decrement with test
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095 --- Comment #10 from Linus Torvalds 2011-05-27 16:48:52 UTC --- (In reply to comment #9) > > 32-bit before32-bit after64-bit before64-bit after > libstdc++.so.60x717080x716e80x67ee60x67ec6 > libgcj.so.120xa3ec580xa3eb980x90cd680x90cce8 > cc1plus0xe8a29c0xe8a25c0xdccf980xdccf08 Ok, that's much less noticeable than I was hoping for. That said, even for the kernel, the only reason I noticed this problem was not because I've seen a lot of them per se, but because the pattern showed up in a very hot function. In fact, it's the same __rcu_read_unlock() function that I made the otherwise unrelated bugzilla entry for inlining: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49194 so it's quite possible that we don't have that many of them in the kernel either.
[Bug rtl-optimization/49095] Horrible code generation for trivial decrement with test
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095 --- Comment #13 from Linus Torvalds 2011-05-29 18:56:44 UTC --- Thanks guys.