Re: Compilers and RCU readers: Once more unto the breach!
* Linus Torvalds wrote: > (a) the "official" rules are completely pointless, and make sense > only because the standard is written for some random "abstract > machine" that doesn't actually exist. Presuming the intent of the abstract machine specification is to avoid being seen as biased towards any specific machine (politics), maybe write this as: (a) the "official" rules are written for a somewhat weird and complex "union of all known and theoretically possible CPU architectures that exist or which might exist in the future", which machine does not actually exist in practice, but which allows a single abstract set of rules to apply to all machines. These rules are complex, but if applied to a specific machine they become considerably simpler. Here's a few examples: ... ? (Assuming it's a goal of this standard to be human parseable to more than a few dozen people on the planet.) Thanks, Ingo
Re: Memory corruption due to word sharing
* Linus Torvalds wrote: > [...] > > And I realize that compiler people tend to think that loop > hoisting etc is absolutely critical for performance, and some > big hammer like "barrier()" makes a compiler person wince. You > think it results in horrible code generation problems. > > It really doesn't. Loops are fairly unusual in the kernel to > begin with, and the compiler barriers are a total non-issue. > We have much more problems with the actual CPU barriers that > can be *very* expensive on some architectures, and we work a > lot at avoiding those and avoiding cacheline ping-pong issues > etc. Just to underline this point, if barriers caused optimization problems when GCC builds the kernel then we'd expect to see various code generation problems: for example the compiler would not be able to cache things well enough and reorder it to make the code faster and (often) more compact. So to test that effect of Linus's claim I picked up a fairly bleeding edge version of GCC: gcc version 4.7.0 20120112 (Red Hat 4.7.0-0.6) (GCC) and performed a test build of the kernel with the majority of optimization barriers removed (using the v3.2 kernel, x86 defconfig, 64-bit, -O2 optimization level): 1600 barriers were removed (!) and GCC's hands were thus freed to create more optimal code [and a very broken kernel], if it could. I compared the resulting kernel image to an unmodified kernel image: text data bss dechex filename 9781555 982328 1118208 11882091 b54e6b vmlinux.vanilla 9780972 982328 1118208 11881508 b54c24 vmlinux.no-barriers So the barriers are costing us only a 0.06% size increase - 583 bytes on an almost 10 MB kernel image. To put that into perspectve: a *single* inline function inlining decision by the compiler has a larger effect than that. Just a couple of days ago we uninlined a function, which had an order of magnitude larger effect than this. The other possible dimension would be the ordering of instructions. To test for that effect I disassembled the two kernel images and performed a function by function, instruction by instruction comparison of instruction ordering. The summary is that GCC was able to remove only 86 instructions (0.005%) and reordered around 2400 instructions (0.15%) - out of about 1,570,000 instructions. Or, put differently, for the 1600 barriers in this particular kernel build, there's about 1.5 instructions reordered and 0.05 instructions removed. I also inspected the type of reordering: the overwhelming majority of reordering happened within a jump-free basic block of instructions and did not affect any loops. Thus much of the effect of barriers kernel is only the crutial effect that we want them to have: to reorder code to have a specific program order sequence - but in the process the barriers() cause very, very small optimization quality side effects. So the numbers support Linus's claim, the kernel incurs very little optimization cost side effects from barriers. Thanks, Ingo
Re: BUG: GCC-4.4.x changes the function frame on some functions
* Linus Torvalds wrote: > Admittedly, anybody who compiles with -pg probably doesn't care deeply > about smaller and more efficient code, since the mcount call overhead > tends to make the thing moot anyway, but it really looks like a > win-win situation to just fix the mcount call sequence regardless. Just a sidenote: due to dyn-ftrace, which patches out all mcounts during bootup to be NOPs (and opt-in patches them in again if someone runs the function tracer), the cost is not as large as one would have it with say -pg based user-space profiling. It's not completely zero-cost as the pure NOPs balloon the i$ footprint a bit and GCC generates different code too in some cases. But it's certainly good enough that it's generally pretty hard to prove overhead via micro or macro benchmarks that the patched out mcounts call sites are there. Ingo
Re: BUG: GCC-4.4.x changes the function frame on some functions
* Steven Rostedt wrote: > On Thu, 2009-11-19 at 19:47 +0100, Ingo Molnar wrote: > > * Linus Torvalds wrote: > > > > > Admittedly, anybody who compiles with -pg probably doesn't care deeply > > > about smaller and more efficient code, since the mcount call overhead > > > tends to make the thing moot anyway, but it really looks like a > > > win-win situation to just fix the mcount call sequence regardless. > > > > Just a sidenote: due to dyn-ftrace, which patches out all mcounts during > > bootup to be NOPs (and opt-in patches them in again if someone runs the > > function tracer), the cost is not as large as one would have it with say > > -pg based user-space profiling. > > > > It's not completely zero-cost as the pure NOPs balloon the i$ footprint > > a bit and GCC generates different code too in some cases. But it's > > certainly good enough that it's generally pretty hard to prove overhead > > via micro or macro benchmarks that the patched out mcounts call sites > > are there. > > And frame pointers do add a little overhead as well. Too bad the mcount > ABI wasn't something like this: > > > : > callmcount > [...] > > This way, the function address for mcount would have been (%esp) and > the parent address would be 4(%esp). Mcount would work without frame > pointers and this whole mess would also become moot. In that case we could also fix up static callsites to this address as well (to jump +5 bytes into the function) and avoid the NOP as well in most cases. (That would in essence merge any slow-path function epilogue with the mcount cal instruction in terms of I$ footprint - i.e. it would be an even lower overhead feature.) If only the kernel had its own compiler. Ingo
Re: BUG: GCC-4.4.x changes the function frame on some functions
* Linus Torvalds wrote: > [ Btw, looking at that, why are X86_L1_CACHE_BYTES and X86_L1_CACHE_SHIFT > totally unrelated numbers? Very confusing. ] incidentally (or maybe not so incidentally) that got fixed yesterday in -tip - at around the time i triggered that crash: 350f8f5: x86: Eliminate redundant/contradicting cache line size config options See the full commit below. The config that triggered the crash for me has: CONFIG_X86_L1_CACHE_SHIFT=4 so it's 16 bytes - and it's consistent now, which is a new angle. So i think this explains why it stayed dormant for such a long time - it was hidden by the cacheline-size config value inconsistencies. Ingo -> >From 350f8f5631922c7848ec4b530c111cb8c2ff7caa Mon Sep 17 00:00:00 2001 From: Jan Beulich Date: Fri, 13 Nov 2009 11:54:40 + Subject: [PATCH] x86: Eliminate redundant/contradicting cache line size config options Rather than having X86_L1_CACHE_BYTES and X86_L1_CACHE_SHIFT (with inconsistent defaults), just having the latter suffices as the former can be easily calculated from it. To be consistent, also change X86_INTERNODE_CACHE_BYTES to X86_INTERNODE_CACHE_SHIFT, and set it to 7 (128 bytes) for NUMA to account for last level cache line size (which here matters more than L1 cache line size). Finally, make sure the default value for X86_L1_CACHE_SHIFT, when X86_GENERIC is selected, is being seen before that for the individual CPU model options (other than on x86-64, where GENERIC_CPU is part of the choice construct, X86_GENERIC is a separate option on ix86). Signed-off-by: Jan Beulich Acked-by: Ravikiran Thirumalai Acked-by: Nick Piggin LKML-Reference: <4afd571002780001f...@vpn.id2.novell.com> Signed-off-by: Ingo Molnar --- arch/x86/Kconfig.cpu | 14 +- arch/x86/boot/compressed/vmlinux.lds.S |3 ++- arch/x86/include/asm/cache.h |7 --- arch/x86/kernel/vmlinux.lds.S | 10 +- arch/x86/mm/tlb.c |3 ++- 5 files changed, 18 insertions(+), 19 deletions(-) diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu index f2824fb..621f2bd 100644 --- a/arch/x86/Kconfig.cpu +++ b/arch/x86/Kconfig.cpu @@ -301,15 +301,11 @@ config X86_CPU # # Define implied options from the CPU selection here -config X86_L1_CACHE_BYTES +config X86_INTERNODE_CACHE_SHIFT int - default "128" if MPSC - default "64" if GENERIC_CPU || MK8 || MCORE2 || MATOM || X86_32 - -config X86_INTERNODE_CACHE_BYTES - int - default "4096" if X86_VSMP - default X86_L1_CACHE_BYTES if !X86_VSMP + default "12" if X86_VSMP + default "7" if NUMA + default X86_L1_CACHE_SHIFT config X86_CMPXCHG def_bool X86_64 || (X86_32 && !M386) @@ -317,9 +313,9 @@ config X86_CMPXCHG config X86_L1_CACHE_SHIFT int default "7" if MPENTIUM4 || MPSC + default "6" if MK7 || MK8 || MPENTIUMM || MCORE2 || MATOM || MVIAC7 || X86_GENERIC || GENERIC_CPU default "4" if X86_ELAN || M486 || M386 || MGEODEGX1 default "5" if MWINCHIP3D || MWINCHIPC6 || MCRUSOE || MEFFICEON || MCYRIXIII || MK6 || MPENTIUMIII || MPENTIUMII || M686 || M586MMX || M586TSC || M586 || MVIAC3_2 || MGEODE_LX - default "6" if MK7 || MK8 || MPENTIUMM || MCORE2 || MATOM || MVIAC7 || X86_GENERIC || GENERIC_CPU config X86_XADD def_bool y diff --git a/arch/x86/boot/compressed/vmlinux.lds.S b/arch/x86/boot/compressed/vmlinux.lds.S index f4193bb..a6f1a59 100644 --- a/arch/x86/boot/compressed/vmlinux.lds.S +++ b/arch/x86/boot/compressed/vmlinux.lds.S @@ -4,6 +4,7 @@ OUTPUT_FORMAT(CONFIG_OUTPUT_FORMAT, CONFIG_OUTPUT_FORMAT, CONFIG_OUTPUT_FORMAT) #undef i386 +#include #include #ifdef CONFIG_X86_64 @@ -46,7 +47,7 @@ SECTIONS *(.data.*) _edata = . ; } - . = ALIGN(CONFIG_X86_L1_CACHE_BYTES); + . = ALIGN(L1_CACHE_BYTES); .bss : { _bss = . ; *(.bss) diff --git a/arch/x86/include/asm/cache.h b/arch/x86/include/asm/cache.h index 549860d..2f9047c 100644 --- a/arch/x86/include/asm/cache.h +++ b/arch/x86/include/asm/cache.h @@ -9,12 +9,13 @@ #define __read_mostly __attribute__((__section__(".data.read_mostly"))) +#define INTERNODE_CACHE_SHIFT CONFIG_X86_INTERNODE_CACHE_SHIFT +#define INTERNODE_CACHE_BYTES (1 << INTERNODE_CACHE_SHIFT) + #ifdef CONFIG_X86_VSMP -/* vSMP Internode cacheline shift */ -#define INTERNODE_CACHE_SHIFT (12) #ifdef CONFIG_SMP #define __cacheline_aligned_in_smp \ - __attribute__((__aligned__(1 << (INTERNODE_CACHE_SHIFT \ + __attribute__((__aligned__(INTERNODE_CACHE_BYTES))) \ __page_aligned_data #endif #endif diff --git a/arch/x86/kernel/
Re: [PATCH][GIT PULL][v2.6.32] tracing/x86: Add check to detect GCC messing with mcount prologue
* Steven Rostedt wrote: > Ingo, Thomas and Linus, > > I know Thomas did a patch to force the -mtune=generic, but just in > case gcc decides to do something crazy again, this patch will catch > it. > > Should we try to get this in now? Very nice example of defensive coding - i like this. I've queued it up for .33 (unless anyone objects) as i think it's too late for .32. Ingo
Re: [PATCH][GIT PULL][v2.6.32] tracing/x86: Add check to detect GCC messing with mcount prologue
* Thomas Gleixner wrote: > On Tue, 24 Nov 2009, Jakub Jelinek wrote: > > > On Tue, Nov 24, 2009 at 03:55:49PM +0100, Thomas Gleixner wrote: > > > > you should compile your code with -maccumulate-outgoing-args, and > > > > there's > > > > no need to use -mtune=generic. Is that right? > > > > > > Seems to work. What other side effects has that ? > > > > Faster code, significant increase in code size though. Note that on many > > architectures it is the only supported model. > > Just checked on the affected -marchs. The increase in code size is > about 3% which is not that bad and definitely acceptable for the > tracing case. Will zap the -mtune=generic patch and use > -maccumulate-outgoing-args instead. hm, 3% sounds quite large :( dyn-ftrace is enabled in distro configs, so 3% is a big deal IMO. Ingo
Re: [PATCH][GIT PULL][v2.6.32] tracing/x86: Add check to detect GCC messing with mcount prologue
* Thomas Gleixner wrote: > On Wed, 25 Nov 2009, Ingo Molnar wrote: > > * Thomas Gleixner wrote: > > > > > On Tue, 24 Nov 2009, Jakub Jelinek wrote: > > > > > > > On Tue, Nov 24, 2009 at 03:55:49PM +0100, Thomas Gleixner wrote: > > > > > > you should compile your code with -maccumulate-outgoing-args, and > > > > > > there's > > > > > > no need to use -mtune=generic. Is that right? > > > > > > > > > > Seems to work. What other side effects has that ? > > > > > > > > Faster code, significant increase in code size though. Note that on > > > > many > > > > architectures it is the only supported model. > > > > > > Just checked on the affected -marchs. The increase in code size is > > > about 3% which is not that bad and definitely acceptable for the > > > tracing case. Will zap the -mtune=generic patch and use > > > -maccumulate-outgoing-args instead. > > > > hm, 3% sounds quite large :( dyn-ftrace is enabled in distro configs, so > > 3% is a big deal IMO. > > Distro-configs have -mtune=generic anyway. So it's not changing > anything for them. > > I'm talking about the -march flags which result in that weird code > (pentium-mmx ). ok! Ingo
Re: [RFC patch] spindep: add cross cache lines checking
* Alex Shi wrote: > > I think the check should be (__alignof__(lock) < > > __alignof__(rwlock_t)), otherwise it will still pass when > > you have structure with attribute((packed,aligned(2))) > > reasonable! > > >> 1, it is alignof bug for default gcc on my fc15 and Ubuntu 11.10 etc? > >> > >> struct sub { > >> int raw_lock; > >> char a; > >> }; > >> struct foo { > >> struct sub z; > >> int slk; > >> char y; > >> }__attribute__((packed)); > >> > >> struct foo f1; > >> > >> __alignof__(f1.z.raw_lock) is 4, but its address actually can align on > >> one byte. > > > > That looks like correct behavior, because the alignment of > > raw_lock inside of struct sub is still 4. But it does mean > > that there can be cases where the compile-time check is not > > sufficient, so we might want the run-time check as well, at > > least under some config option. > > what's your opinion of this, Ingo? Dunno. How many real bugs have you found via this patch? Thanks, Ingo
Re: [RFC patch] spindep: add cross cache lines checking
* Alex Shi wrote: > On Wed, 2012-03-07 at 14:39 +0100, Ingo Molnar wrote: > > * Alex Shi wrote: > > > > > > I think the check should be (__alignof__(lock) < > > > > __alignof__(rwlock_t)), otherwise it will still pass when > > > > you have structure with attribute((packed,aligned(2))) > > > > > > reasonable! > > > > > > >> 1, it is alignof bug for default gcc on my fc15 and Ubuntu 11.10 etc? > > > >> > > > >> struct sub { > > > >> int raw_lock; > > > >> char a; > > > >> }; > > > >> struct foo { > > > >> struct sub z; > > > >> int slk; > > > >> char y; > > > >> }__attribute__((packed)); > > > >> > > > >> struct foo f1; > > > >> > > > >> __alignof__(f1.z.raw_lock) is 4, but its address actually can align on > > > >> one byte. > > > > > > > > That looks like correct behavior, because the alignment of > > > > raw_lock inside of struct sub is still 4. But it does mean > > > > that there can be cases where the compile-time check is not > > > > sufficient, so we might want the run-time check as well, at > > > > least under some config option. > > > > > > what's your opinion of this, Ingo? > > > > Dunno. How many real bugs have you found via this patch? > > None. Guess stupid code was shot in lkml reviewing. But if the > patch in, it is helpful to block stupid code in developing. The question is, if in the last 10 years not a single such case made it through to today's 15 million lines of kernel code, why should we add the check now? If it was a simple build time check then maybe, but judging by the discussion it does not seem so simple, does it? Thanks, Ingo