from:"Ingo Molnar"

Re: Compilers and RCU readers: Once more unto the breach!

2015-05-21 Thread Ingo Molnar

* Linus Torvalds  wrote:

>  (a) the "official" rules are completely pointless, and make sense 
> only because the standard is written for some random "abstract 
> machine" that doesn't actually exist.

Presuming the intent of the abstract machine specification is to avoid 
being seen as biased towards any specific machine (politics), maybe 
write this as:

   (a) the "official" rules are written for a somewhat weird and 
   complex "union of all known and theoretically possible CPU 
   architectures that exist or which might exist in the future", 
   which machine does not actually exist in practice, but which 
   allows a single abstract set of rules to apply to all machines. 
   These rules are complex, but if applied to a specific machine 
   they become considerably simpler. Here's a few examples: ...

?

(Assuming it's a goal of this standard to be human parseable to more 
than a few dozen people on the planet.)

Thanks,

Ingo

Re: Memory corruption due to word sharing

2012-02-02 Thread Ingo Molnar

* Linus Torvalds  wrote:

> [...]
> 
> And I realize that compiler people tend to think that loop 
> hoisting etc is absolutely critical for performance, and some 
> big hammer like "barrier()" makes a compiler person wince. You 
> think it results in horrible code generation problems.
> 
> It really doesn't. Loops are fairly unusual in the kernel to 
> begin with, and the compiler barriers are a total non-issue. 
> We have much more problems with the actual CPU barriers that 
> can be *very* expensive on some architectures, and we work a 
> lot at avoiding those and avoiding cacheline ping-pong issues 
> etc.

Just to underline this point, if barriers caused optimization 
problems when GCC builds the kernel then we'd expect to see 
various code generation problems: for example the compiler would 
not be able to cache things well enough and reorder it to make 
the code faster and (often) more compact.

So to test that effect of Linus's claim I picked up a fairly 
bleeding edge version of GCC:

  gcc version 4.7.0 20120112 (Red Hat 4.7.0-0.6) (GCC)

and performed a test build of the kernel with the majority of 
optimization barriers removed (using the v3.2 kernel, x86 
defconfig, 64-bit, -O2 optimization level): 1600 barriers were 
removed (!) and GCC's hands were thus freed to create more 
optimal code [and a very broken kernel], if it could.

I compared the resulting kernel image to an unmodified kernel 
image:

text   data bss  dechex filename
 9781555 982328 1118208 11882091 b54e6b vmlinux.vanilla
 9780972 982328 1118208 11881508 b54c24 vmlinux.no-barriers

So the barriers are costing us only a 0.06% size increase - 583 
bytes on an almost 10 MB kernel image.

To put that into perspectve: a *single* inline function inlining 
decision by the compiler has a larger effect than that. Just a 
couple of days ago we uninlined a function, which had an order 
of magnitude larger effect than this.

The other possible dimension would be the ordering of 
instructions.

To test for that effect I disassembled the two kernel images and 
performed a function by function, instruction by instruction 
comparison of instruction ordering. The summary is that GCC was 
able to remove only 86 instructions (0.005%) and reordered 
around 2400 instructions (0.15%) - out of about 1,570,000 
instructions.

Or, put differently, for the 1600 barriers in this particular 
kernel build, there's about 1.5 instructions reordered and 0.05 
instructions removed.

I also inspected the type of reordering: the overwhelming 
majority of reordering happened within a jump-free basic block 
of instructions and did not affect any loops.

Thus much of the effect of barriers kernel is only the crutial 
effect that we want them to have: to reorder code to have a 
specific program order sequence - but in the process the 
barriers() cause very, very small optimization quality side 
effects.

So the numbers support Linus's claim, the kernel incurs very 
little optimization cost side effects from barriers.

Thanks,

Ingo

Re: BUG: GCC-4.4.x changes the function frame on some functions

2009-11-19 Thread Ingo Molnar

* Linus Torvalds  wrote:

> Admittedly, anybody who compiles with -pg probably doesn't care deeply 
> about smaller and more efficient code, since the mcount call overhead 
> tends to make the thing moot anyway, but it really looks like a 
> win-win situation to just fix the mcount call sequence regardless.

Just a sidenote: due to dyn-ftrace, which patches out all mcounts during 
bootup to be NOPs (and opt-in patches them in again if someone runs the 
function tracer), the cost is not as large as one would have it with say 
-pg based user-space profiling.

It's not completely zero-cost as the pure NOPs balloon the i$ footprint 
a bit and GCC generates different code too in some cases. But it's 
certainly good enough that it's generally pretty hard to prove overhead 
via micro or macro benchmarks that the patched out mcounts call sites 
are there.

Ingo

Re: BUG: GCC-4.4.x changes the function frame on some functions

2009-11-19 Thread Ingo Molnar


* Steven Rostedt  wrote:

> On Thu, 2009-11-19 at 19:47 +0100, Ingo Molnar wrote:
> > * Linus Torvalds  wrote:
> > 
> > > Admittedly, anybody who compiles with -pg probably doesn't care deeply 
> > > about smaller and more efficient code, since the mcount call overhead 
> > > tends to make the thing moot anyway, but it really looks like a 
> > > win-win situation to just fix the mcount call sequence regardless.
> > 
> > Just a sidenote: due to dyn-ftrace, which patches out all mcounts during 
> > bootup to be NOPs (and opt-in patches them in again if someone runs the 
> > function tracer), the cost is not as large as one would have it with say 
> > -pg based user-space profiling.
> > 
> > It's not completely zero-cost as the pure NOPs balloon the i$ footprint 
> > a bit and GCC generates different code too in some cases. But it's 
> > certainly good enough that it's generally pretty hard to prove overhead 
> > via micro or macro benchmarks that the patched out mcounts call sites 
> > are there.
> 
> And frame pointers do add a little overhead as well. Too bad the mcount
> ABI wasn't something like this:
> 
> 
>   :
>   callmcount
>   [...]
> 
> This way, the function address for mcount would have been (%esp) and 
> the parent address would be 4(%esp). Mcount would work without frame 
> pointers and this whole mess would also become moot.

In that case we could also fix up static callsites to this address as 
well (to jump +5 bytes into the function) and avoid the NOP as well in 
most cases. (That would in essence merge any slow-path function epilogue 
with the mcount cal instruction in terms of I$ footprint - i.e. it would 
be an even lower overhead feature.)

If only the kernel had its own compiler.

Ingo

Re: BUG: GCC-4.4.x changes the function frame on some functions

2009-11-19 Thread Ingo Molnar


* Linus Torvalds  wrote:

> [ Btw, looking at that, why are X86_L1_CACHE_BYTES and X86_L1_CACHE_SHIFT 
>   totally unrelated numbers? Very confusing. ]

incidentally (or maybe not so incidentally) that got fixed yesterday in 
-tip - at around the time i triggered that crash:

  350f8f5: x86: Eliminate redundant/contradicting cache line size config options

See the full commit below. The config that triggered the crash for me 
has:

  CONFIG_X86_L1_CACHE_SHIFT=4

so it's 16 bytes - and it's consistent now, which is a new angle. So i 
think this explains why it stayed dormant for such a long time - it was 
hidden by the cacheline-size config value inconsistencies.

Ingo

->
>From 350f8f5631922c7848ec4b530c111cb8c2ff7caa Mon Sep 17 00:00:00 2001
From: Jan Beulich 
Date: Fri, 13 Nov 2009 11:54:40 +
Subject: [PATCH] x86: Eliminate redundant/contradicting cache line size config 
options

Rather than having X86_L1_CACHE_BYTES and X86_L1_CACHE_SHIFT
(with inconsistent defaults), just having the latter suffices as
the former can be easily calculated from it.

To be consistent, also change X86_INTERNODE_CACHE_BYTES to
X86_INTERNODE_CACHE_SHIFT, and set it to 7 (128 bytes) for NUMA
to account for last level cache line size (which here matters
more than L1 cache line size).

Finally, make sure the default value for X86_L1_CACHE_SHIFT,
when X86_GENERIC is selected, is being seen before that for the
individual CPU model options (other than on x86-64, where
GENERIC_CPU is part of the choice construct, X86_GENERIC is a
separate option on ix86).

Signed-off-by: Jan Beulich 
Acked-by: Ravikiran Thirumalai 
Acked-by: Nick Piggin 
LKML-Reference: <4afd571002780001f...@vpn.id2.novell.com>
Signed-off-by: Ingo Molnar 
---
 arch/x86/Kconfig.cpu   |   14 +-
 arch/x86/boot/compressed/vmlinux.lds.S |3 ++-
 arch/x86/include/asm/cache.h   |7 ---
 arch/x86/kernel/vmlinux.lds.S  |   10 +-
 arch/x86/mm/tlb.c  |3 ++-
 5 files changed, 18 insertions(+), 19 deletions(-)

diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu
index f2824fb..621f2bd 100644
--- a/arch/x86/Kconfig.cpu
+++ b/arch/x86/Kconfig.cpu
@@ -301,15 +301,11 @@ config X86_CPU
 
 #
 # Define implied options from the CPU selection here
-config X86_L1_CACHE_BYTES
+config X86_INTERNODE_CACHE_SHIFT
int
-   default "128" if MPSC
-   default "64" if GENERIC_CPU || MK8 || MCORE2 || MATOM || X86_32
-
-config X86_INTERNODE_CACHE_BYTES
-   int
-   default "4096" if X86_VSMP
-   default X86_L1_CACHE_BYTES if !X86_VSMP
+   default "12" if X86_VSMP
+   default "7" if NUMA
+   default X86_L1_CACHE_SHIFT
 
 config X86_CMPXCHG
def_bool X86_64 || (X86_32 && !M386)
@@ -317,9 +313,9 @@ config X86_CMPXCHG
 config X86_L1_CACHE_SHIFT
int
default "7" if MPENTIUM4 || MPSC
+   default "6" if MK7 || MK8 || MPENTIUMM || MCORE2 || MATOM || MVIAC7 || 
X86_GENERIC || GENERIC_CPU
default "4" if X86_ELAN || M486 || M386 || MGEODEGX1
default "5" if MWINCHIP3D || MWINCHIPC6 || MCRUSOE || MEFFICEON || 
MCYRIXIII || MK6 || MPENTIUMIII || MPENTIUMII || M686 || M586MMX || M586TSC || 
M586 || MVIAC3_2 || MGEODE_LX
-   default "6" if MK7 || MK8 || MPENTIUMM || MCORE2 || MATOM || MVIAC7 || 
X86_GENERIC || GENERIC_CPU
 
 config X86_XADD
def_bool y
diff --git a/arch/x86/boot/compressed/vmlinux.lds.S 
b/arch/x86/boot/compressed/vmlinux.lds.S
index f4193bb..a6f1a59 100644
--- a/arch/x86/boot/compressed/vmlinux.lds.S
+++ b/arch/x86/boot/compressed/vmlinux.lds.S
@@ -4,6 +4,7 @@ OUTPUT_FORMAT(CONFIG_OUTPUT_FORMAT, CONFIG_OUTPUT_FORMAT, 
CONFIG_OUTPUT_FORMAT)
 
 #undef i386
 
+#include 
 #include 
 
 #ifdef CONFIG_X86_64
@@ -46,7 +47,7 @@ SECTIONS
*(.data.*)
_edata = . ;
}
-   . = ALIGN(CONFIG_X86_L1_CACHE_BYTES);
+   . = ALIGN(L1_CACHE_BYTES);
.bss : {
_bss = . ;
*(.bss)
diff --git a/arch/x86/include/asm/cache.h b/arch/x86/include/asm/cache.h
index 549860d..2f9047c 100644
--- a/arch/x86/include/asm/cache.h
+++ b/arch/x86/include/asm/cache.h
@@ -9,12 +9,13 @@
 
 #define __read_mostly __attribute__((__section__(".data.read_mostly")))
 
+#define INTERNODE_CACHE_SHIFT CONFIG_X86_INTERNODE_CACHE_SHIFT
+#define INTERNODE_CACHE_BYTES (1 << INTERNODE_CACHE_SHIFT)
+
 #ifdef CONFIG_X86_VSMP
-/* vSMP Internode cacheline shift */
-#define INTERNODE_CACHE_SHIFT (12)
 #ifdef CONFIG_SMP
 #define __cacheline_aligned_in_smp \
-   __attribute__((__aligned__(1 << (INTERNODE_CACHE_SHIFT  \
+   __attribute__((__aligned__(INTERNODE_CACHE_BYTES))) \
__page_aligned_data
 #endif
 #endif
diff --git a/arch/x86/kernel/

Re: [PATCH][GIT PULL][v2.6.32] tracing/x86: Add check to detect GCC messing with mcount prologue

2009-11-22 Thread Ingo Molnar

* Steven Rostedt  wrote:

> Ingo, Thomas and Linus,
> 
> I know Thomas did a patch to force the -mtune=generic, but just in 
> case gcc decides to do something crazy again, this patch will catch 
> it.
> 
> Should we try to get this in now?

Very nice example of defensive coding - i like this. I've queued it up 
for .33 (unless anyone objects) as i think it's too late for .32.

Ingo

Re: [PATCH][GIT PULL][v2.6.32] tracing/x86: Add check to detect GCC messing with mcount prologue

2009-11-25 Thread Ingo Molnar


* Thomas Gleixner  wrote:

> On Tue, 24 Nov 2009, Jakub Jelinek wrote:
> 
> > On Tue, Nov 24, 2009 at 03:55:49PM +0100, Thomas Gleixner wrote:
> > > > you should compile your code with -maccumulate-outgoing-args, and 
> > > > there's
> > > > no need to use -mtune=generic.  Is that right?
> > > 
> > > Seems to work. What other side effects has that ?
> > 
> > Faster code, significant increase in code size though.  Note that on many
> > architectures it is the only supported model.
> 
> Just checked on the affected -marchs. The increase in code size is 
> about 3% which is not that bad and definitely acceptable for the 
> tracing case. Will zap the -mtune=generic patch and use 
> -maccumulate-outgoing-args instead.

hm, 3% sounds quite large :( dyn-ftrace is enabled in distro configs, so 
3% is a big deal IMO.

Ingo

Re: [PATCH][GIT PULL][v2.6.32] tracing/x86: Add check to detect GCC messing with mcount prologue

2009-11-25 Thread Ingo Molnar


* Thomas Gleixner  wrote:

> On Wed, 25 Nov 2009, Ingo Molnar wrote:
> > * Thomas Gleixner  wrote:
> > 
> > > On Tue, 24 Nov 2009, Jakub Jelinek wrote:
> > > 
> > > > On Tue, Nov 24, 2009 at 03:55:49PM +0100, Thomas Gleixner wrote:
> > > > > > you should compile your code with -maccumulate-outgoing-args, and 
> > > > > > there's
> > > > > > no need to use -mtune=generic.  Is that right?
> > > > > 
> > > > > Seems to work. What other side effects has that ?
> > > > 
> > > > Faster code, significant increase in code size though.  Note that on 
> > > > many
> > > > architectures it is the only supported model.
> > > 
> > > Just checked on the affected -marchs. The increase in code size is 
> > > about 3% which is not that bad and definitely acceptable for the 
> > > tracing case. Will zap the -mtune=generic patch and use 
> > > -maccumulate-outgoing-args instead.
> > 
> > hm, 3% sounds quite large :( dyn-ftrace is enabled in distro configs, so 
> > 3% is a big deal IMO.
> 
> Distro-configs have -mtune=generic anyway. So it's not changing
> anything for them.
> 
> I'm talking about the -march flags which result in that weird code
> (pentium-mmx ).

ok!

Ingo

Re: [RFC patch] spindep: add cross cache lines checking

2012-03-07 Thread Ingo Molnar


* Alex Shi  wrote:

> > I think the check should be (__alignof__(lock) < 
> > __alignof__(rwlock_t)), otherwise it will still pass when 
> > you have structure with attribute((packed,aligned(2)))
> 
> reasonable!
> 
> >> 1, it is alignof bug for default gcc on my fc15 and Ubuntu 11.10 etc?
> >>
> >> struct sub {
> >> int  raw_lock;
> >> char a;
> >> };
> >> struct foo {
> >> struct sub z;
> >> int slk;
> >> char y;
> >> }__attribute__((packed));
> >>
> >> struct foo f1;
> >>
> >> __alignof__(f1.z.raw_lock) is 4, but its address actually can align on
> >> one byte. 
> > 
> > That looks like correct behavior, because the alignment of 
> > raw_lock inside of struct sub is still 4. But it does mean 
> > that there can be cases where the compile-time check is not 
> > sufficient, so we might want the run-time check as well, at 
> > least under some config option.
> 
> what's your opinion of this, Ingo?

Dunno. How many real bugs have you found via this patch?

Thanks,

Ingo

Re: [RFC patch] spindep: add cross cache lines checking

2012-03-07 Thread Ingo Molnar


* Alex Shi  wrote:

> On Wed, 2012-03-07 at 14:39 +0100, Ingo Molnar wrote:
> > * Alex Shi  wrote:
> > 
> > > > I think the check should be (__alignof__(lock) < 
> > > > __alignof__(rwlock_t)), otherwise it will still pass when 
> > > > you have structure with attribute((packed,aligned(2)))
> > > 
> > > reasonable!
> > > 
> > > >> 1, it is alignof bug for default gcc on my fc15 and Ubuntu 11.10 etc?
> > > >>
> > > >> struct sub {
> > > >> int  raw_lock;
> > > >> char a;
> > > >> };
> > > >> struct foo {
> > > >> struct sub z;
> > > >> int slk;
> > > >> char y;
> > > >> }__attribute__((packed));
> > > >>
> > > >> struct foo f1;
> > > >>
> > > >> __alignof__(f1.z.raw_lock) is 4, but its address actually can align on
> > > >> one byte. 
> > > > 
> > > > That looks like correct behavior, because the alignment of 
> > > > raw_lock inside of struct sub is still 4. But it does mean 
> > > > that there can be cases where the compile-time check is not 
> > > > sufficient, so we might want the run-time check as well, at 
> > > > least under some config option.
> > > 
> > > what's your opinion of this, Ingo?
> > 
> > Dunno. How many real bugs have you found via this patch?
> 
> None. Guess stupid code was shot in lkml reviewing. But if the 
> patch in, it is helpful to block stupid code in developing.

The question is, if in the last 10 years not a single such case 
made it through to today's 15 million lines of kernel code, why 
should we add the check now?

If it was a simple build time check then maybe, but judging by 
the discussion it does not seem so simple, does it?

Thanks,

Ingo

Re: Compilers and RCU readers: Once more unto the breach!

Re: Memory corruption due to word sharing

Re: BUG: GCC-4.4.x changes the function frame on some functions

Re: BUG: GCC-4.4.x changes the function frame on some functions

Re: BUG: GCC-4.4.x changes the function frame on some functions

Re: [PATCH][GIT PULL][v2.6.32] tracing/x86: Add check to detect GCC messing with mcount prologue

Re: [PATCH][GIT PULL][v2.6.32] tracing/x86: Add check to detect GCC messing with mcount prologue

Re: [PATCH][GIT PULL][v2.6.32] tracing/x86: Add check to detect GCC messing with mcount prologue

Re: [RFC patch] spindep: add cross cache lines checking

Re: [RFC patch] spindep: add cross cache lines checking

10 matches

Site Navigation

Mail list logo

Footer information