[Bug tree-optimization/94527] RFE: Add an __attribute__ that marks a function as freeing an object

2020-04-07 Thread torva...@linux-foundation.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94527

Linus Torvalds  changed:

   What|Removed |Added

 CC||torvalds@linux-foundation.o
   ||rg

--- Comment #2 from Linus Torvalds  ---
One silly example of potential for dead store elimination would be
kernel/async.c: async_run_entry_fn(), where we have

        /* 2) remove self from the pending queues */
        spin_lock_irqsave(&async_lock, flags);
        list_del_init(&entry->domain_list);
        list_del_init(&entry->global_list);

        /* 3) free the entry */
        kfree(entry);  
        atomic_dec(&entry_count);

and while it's good form to do "list_del_init()" on those fields in entry, the
fact that we then do "kfree(entry)" right afterwards means that the stores that
re-initialize the entry list are dead.

If gcc knew that "kfree(entry)" de-allocates the entry pointer, it could remove
them, the same way gcc already removes dead stores to automatic variables.

But again: warnings about mis-use are likely more important than DSE. We have
had the silly kinds of bugs where we move code around, and some access remains
after a free. We have good tools (both static and dynamic) to find those
after-the-fact, of course, but the compiler warning about stupidity is even
better.

[Bug tree-optimization/94527] RFE: Add an __attribute__ that marks a function as freeing an object

2020-04-07 Thread torva...@linux-foundation.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94527

--- Comment #4 from Linus Torvalds  ---
(In reply to Jeffrey A. Law from comment #3)
> GCC already knows that free() "kills" the pointed-to memory and should be
> doing DSE with that in mind.  It doesn't however know that other functions
> have free-like semantics, so it wouldn't do so in for kfree. 

Oh, ok, so the logic already exists, just not the interface to tell anybody
else.

I suspect even non-kernel users might have wrappers around free that might be
able to use a "this acts like free()" marker.

> With regard to the warnings.  When we were investigating use-after-free and
> double-free diagnostics it was our conclusion that do to any kind of
> reasonable job you really have to do a whole program analysis.  Otherwise
> it's just a toy.  As a result the focal point for those diagnostics is the
> static analyzer David Malcolm is working on.

Obviously a static analyzer is better.

That said, we've had some stupid bugs wrt kfree(). Things like releasing things
twice in error paths etc.

So yeah, doing it in the compiler isn't going to catch the subtle cases, but
catching the stupid cases early would still be a good thing.

But I also realize that it might not be worth it to you guys. Since you already
effectively have the DSE code, that looks like a much cheaper thing to do.

(And maybe one day somebody will go "I can trivially see use-after-free things
too, and warn about it", so just having the marker might result in the warnings
at some point too).

[Bug middle-end/94527] RFE: Add an __attribute__ that marks a function as freeing an object

2020-04-08 Thread torva...@linux-foundation.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94527

--- Comment #8 from Linus Torvalds  ---
(In reply to Jonathan Wakely from comment #6)
> I can see uses that aren't just "frees the memory", e.g. after fclose and
> close any further uses of their argument are probably errors. The close case
> is interesting because it's not a pointer argument.

I suspect you'll find that if it's not a pointer argument, it ends up being not
very useful.

To take your "close()" example: sure, after calling close, the fd isn't active
any more, and you can't use it for IO. So it's similar to free() in many
respects: the lifetime is over and done with as a file descriptor.

But it's very different in other ways. For example, it is _entirely_ valid to
do something like this:

 close(fd);
 mark_fd_freed(fd);

where the application keeps track of free fd ranges on its own (or it keeps
track of a set of file descriptors it is polling, or whatever).

So you can still validly use the 'fd' value even after you closed it, in ways
that is not really the case with pointers from memory allocators and free().

Dereferencing a pointer after freeing it is basically _always_ a bug, but using
a file descriptor after closing it might not be.

[Bug inline-asm/94522] Enhancement request: asm goto with outputs

2020-04-08 Thread torva...@linux-foundation.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94522

Linus Torvalds  changed:

   What|Removed |Added

 CC||torvalds@linux-foundation.o
   ||rg

--- Comment #4 from Linus Torvalds  ---
(In reply to Andrew Pinski from comment #1)
> Take:
> asm goto ("" : "=r"(*m) :);
> 
> Where does the store to *m happen?  Do you replicate it on to the label side
> too?
> 
> What are the semantics for the above case on clang?  My bet is it is not
> well defined and really broken.

The outputs are defined to have values only in the fallthrough case.

On the label side, the outputs (whether memory or register) are explicitly
undefined. 

The outputs may obviously not even be in scope except in the fallthrough.

Now, with memory ops, it's obviously quite possible that the programmer then
knows that they wrote an inline asm that did the write to memory before it did
the goto. 

But that's no different from a _non_output_ "asm goto" that you pass a memory
pointer to, so I don't think that's something that is new to this situation.

As to whether this is simple or not to do in gcc - the clang implementation has
been buggy so far, altough it's good enough for testing ;)

[Bug inline-asm/94522] Enhancement request: asm goto with outputs

2020-04-08 Thread torva...@linux-foundation.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94522

--- Comment #5 from Linus Torvalds  ---
Btw, Nick (who is doing this on the clang side, tells me that the tcmalloc
people are looking at using the asm goto with outputs too, so it's not just the
kernel.

If somebody wants to play with it, I do have a patch to use it in the kernel as
a test-case.

HOWEVER - I'm working on cleaning up some of the infrastructure around it, but
at least for now, that patch is a "all or nothing" thing: it unconditionally
requires asm goto w/ inputs support, so there's no fallback to "older compiler
without asm goto with inputs" codepath.

With the cleanups, I hope to have a patch that can swing both ways, but right
now it's very much for testing only, and I don't want to make that patch too
public because it will just break for anybody with a non-experimental compiler.

[Bug c/89501] New: Odd lack of warning about missing initialization

2019-02-25 Thread torva...@linux-foundation.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89501

Bug ID: 89501
   Summary: Odd lack of warning about missing initialization
   Product: gcc
   Version: 8.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: torva...@linux-foundation.org
  Target Milestone: ---

Created attachment 45820
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45820&action=edit
You can compile this to see a lack of warning.

We had a simple kernel patch that introduced a stupid bug due to an
uninitialized variable, and while we got it all sorted out and the fix was
trivial, people were surprised by the lack of warning for the uninitialized
case.

I'm adding a test-case as an attachment that largely matches the kernel code
that didn't warn. But it boils down to a pattern of

 int ret;  /* UNINITIALIZED */

 if (somecondition) {
  ret = functioncall(x);
  if (ret)
   return ret;
 }
 .. some more work ..
 return ret;   /* Possibly uninitialized return value! */

What I *suspect* happens is

 (a) gcc sees that there is only one assignment to "ret"

 (b) in the same basic block as the assignment, there is a test against "ret"
being nonzero that goes out.

and what I think happens is that (a) causes gcc to consider that assignment to
be the defining assignment (which makes all kinds of sense in an SSA world),
and then (b) means that gcc decides that clearly "ret" has to be zero in any
case that doesn't go out due to the if-test.

So it turns out that gcc almost by mistake generates code that works (and
doesn't warn about it, exactly because it works), even though the source code
was clearly buggy. 

The attached test-case is stupid but intentionally made to be as close to the
kernel source case as possible. With it, I can do:

Look, ma: no warning:

   gcc -O2 -S -Wall test.c

but this gives the expected warning:

   gcc -O2 -S -Wall -DHIDE_PROBLEM test.c

regardless, this is not a huge problem for us, but I decided to make a report
since we have a test case, and maybe somebody gets excited about it.

Thanks,

Linus

[Bug middle-end/89501] Odd lack of warning about missing initialization

2019-02-25 Thread torva...@linux-foundation.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89501

--- Comment #2 from Linus Torvalds  ---
(In reply to Andrew Pinski from comment #1)
> I think it comes down to the same issue as PR 18501.

Very possibly the same issue in just a different guise.

NOTE! I have in the meantime verified that yes, it does seem to be about the
pattern

   int x;

   if (somecondition) {
  x = something();
  if (x != XYZ)
 return x;
   }

   return x;

where gcc seems to turn the "if (x != XYZ) return x" to mean that "x" clearly
_has_ to be XYZ elsewhere.

If I change my kernel-based test-case to do

if (ret != 1)
return ret;

instead of the original

if (ret)
return ret;

then gcc will actually generate code that ends with

movl$1, %eax
popq%rbp
popq%r12
ret

ie it will basically consider "ret" to be initialized to that value "1", even
if the basic block that assigned it was never actually executed.

Knowing how SSA works, I'm not entirely surprised, but obviously if you'd like
to see the warning about buggy source code, it's less than optimal.

Anyway, this shouldn't be a high priority, but it does strike me as a
potentially fairly common pattern that people might be missing warnings for.

  Linus

[Bug middle-end/89501] Odd lack of warning about missing initialization

2019-02-26 Thread torva...@linux-foundation.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89501

--- Comment #6 from Linus Torvalds  ---
(In reply to Richard Biener from comment #5)
> 
> And this meeting helps us avoid bogus warnings for cases where GCC has
> difficulties proving dead code paths are actually dead ... 

Ack. I do see the difficulty. We already disable some of the
"may-be-uninitialized" warnings in the kernel because they end up being too
noisy for some configurations.

NP. If you guys figure something out, it will be good. If not, we'll survive.

[Bug middle-end/89501] Odd lack of warning about missing initialization

2019-02-26 Thread torva...@linux-foundation.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89501

--- Comment #8 from Linus Torvalds  ---
(In reply to Jeffrey A. Law from comment #7)
> It's reliably the case that a false positive uninit warning also represents
> a failure to optimize something.  So we've got significant incentives to
> deeply analyze and look for fixes.  So feel free to pass examples of those
> along.

Well, most of it is due to interactions with *other* issues entirely.

For example, when we enable GCOV for coverage checking, we have to disable
tree-loop-im, because of excessive stack usage due to gcc bugzilla 69702.

And disabling that optimization then leads to bogus "might be used
uninitialized" warnings.

We have a few other cases like that. Eg we can't use -Os without also disabling
-Wmaybe-uninitialized etc.

Some of the cases may be historical (ie maybe you've fixed the issues that
cause us to do that in newer versions), but for various distro reasons we end
up supporting old compilers for a _looong_ time.

We not that long ago ended up increasing the minimum required gcc version for
the kernel to gcc-4.6.

So we have this odd "we love using new features" fight with "oops, but we end
up having people using relatively ancient compiler versions".

[Bug middle-end/89501] Odd lack of warning about missing initialization

2019-03-04 Thread torva...@linux-foundation.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89501

--- Comment #10 from Linus Torvalds  ---
(In reply to ncm from comment #9)
> What I don't understand is why it doesn't optimize away the check on
> (somecondition), since it is assuming the code in the dependent block always
> runs.

No, it very much doesn't assume that. The 'somecondition' really is dynamic.

What happens is simply that because gcc sees only a single assignment to the
variable, and that assignment is then limited by the subsequent value test to a
single value, gcc will just say "ok, any other place where that variable is
used, just use the known single value".

And it does that whether the 'if (somecondition)' path was taken or not.

It's a perfectly valid optimization. In fact, it's a lovely optimization, I'm
not at all complaining about the code generation.

It's just that as part of that (quite reasonable) optimization it also
optimized away the whole "oops, there wasn't really a valid initialization in
case the if-statement wasn't done".

Obviously that's undefined behavior, and the optimization is valid regardless,
but the lack of warning means that we didn't see that we had technically
undefined behavior that the compiler has silently just fixed up for us.

I think the cause of this all is quite reasonable and understandable, and I
also see why gcc really does want to throw away the undefined case entirely
(because otherwise you can get into the reverse situation where you warn
unnecessarily, because gcc isn't smart enough to see that some undefined case
will never ever actually happen). 

Plus I assume it simplifies things a lot to just not even have to track the
undefined case at all. You can just track "ok, down this path we have a known
value for this SSA, and we don't need to keep track of any inconvenient phi
nodes etc".

[Bug middle-end/89501] Odd lack of warning about missing initialization

2019-03-04 Thread torva...@linux-foundation.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89501

--- Comment #12 from Linus Torvalds  ---
(In reply to Jeffrey A. Law from comment #11)
> 
> More generally we  have considered whether or not we could eliminate the
> control dependent path which leads to undefined behavior.  But you have to
> be careful  because statements on the path prior to the actual undefined
> behavior may have observable side effects.

Note that for the kernel, we consider those kinds of "optimizations" completely
and utterly wrong-headed, and will absolutely refuse to use a compiler that
does things like that.

It's basically the compiler saying "I don't care what you meant, I can do
anything I want, and that means I will screw the code up on purpose".

I will personally switch the kernel immediately to clang the moment we cannot
turn off idiotic broken behavior like that.

We currently already disable 'strict-aliasing', 'strict-overflow' and
'delete-null-pointer-checks'. Because compilers that intentionally create
garbage code are evil and wrong.

Compiler standards bodies that add even more of them (the C aliasing rules
being the prime example) should be ashamed of themselves.

And compiler people that think it's ok to say "it's undefined, so we do
anything we damn well like" shouldn't be working with compilers IMNSHO.

If you know something is undefined,  you warn about it. You don't silently
generate random code that doesn't match the source code just because some paper
standard says you "can".

[Bug c/61904] New: Incorrect stack red-zoning on x86-64 code generation

2014-07-25 Thread torva...@linux-foundation.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61904

Bug ID: 61904
   Summary: Incorrect stack red-zoning on x86-64 code generation
   Product: gcc
   Version: 4.9.0
Status: UNCONFIRMED
  Severity: critical
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: torva...@linux-foundation.org

gcc-4.9.0 in Debian seems to miscompile the linux kernel for x86-64 in certain
configurations, creating accesses to below the stack pointer even though the
kernel uses -mno-red-zone.

The kernel cannot use the x86-64 stack red-zoning, because the hardware only
switches stacks on privilege transfers, so interrupts that happen in kernel
mode will not honor the normal 128-byte stack red-zone.

Attached is the pre-processed C code of the current kernel file

   kernel/sched/fair.c

which apparently on gcc-4.9.0 will miscompile the function "load_balance()",
creating code like this:

load_balance:
.LFB2408:
.loc 2 6487 0
.cfi_startproc
.LVL1355:
pushq   %rbp#
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq%rsp, %rbp  #,
.cfi_def_cfa_register 6
pushq   %r15#
pushq   %r14#
pushq   %r13#
pushq   %r12#
.cfi_offset 15, -24
.cfi_offset 14, -32
.cfi_offset 13, -40
.cfi_offset 12, -48
movq%rdx, %r12  # sd, sd
pushq   %rbx#
.LBB2877:
.loc 2 6493 0
movq$load_balance_mask, -136(%rbp)  #, %sfp
.LBE2877:
.loc 2 6487 0
subq$184, %rsp  #,
.cfi_offset 3, -56
.loc 2 6489 0
 


Note the "subq$184, %rsp" *after* the compiler has already spilled to the
stack (the spill is insane, btw, since it's spilling a constant value!)

The second attachement is the reported mis-compiled result. I don't personally
have the affected gcc version, but you can see the options passed into the
compiler in the resulting "fair.s" file. The "-Os" in particular seems to be
important, with the bug not happening with "-O2".


[Bug c/61904] Incorrect stack red-zoning on x86-64 code generation

2014-07-25 Thread torva...@linux-foundation.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61904

--- Comment #1 from Linus Torvalds  ---
Created attachment 33183
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33183&action=edit
Buggy result of compilation

This was sent by the reporter of the kernel problem, Michel Dänzer
, at my request.

It's gzipped to fit the file size limit.

[Bug target/61904] Incorrect stack red-zoning on x86-64 code generation

2014-07-25 Thread torva...@linux-foundation.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61904

--- Comment #3 from Linus Torvalds  ---
Created attachment 33184
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33184&action=edit
gzipped pre-processed fair.i file

Apparently attaching a file during the initial bug entry creation failed,
probably because it failed the size check. So here's the fair.i file.


[Bug target/61904] Incorrect stack red-zoning on x86-64 code generation

2014-07-25 Thread torva...@linux-foundation.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61904

--- Comment #4 from Linus Torvalds  ---
As I already mentioned to Jakub Jelinek separately in the original email thread
about the kernel problem:

 "Note that I don't personally have a reproducer (my machine has
  gcc-4.8.3, and I don't see the same behavior), but I included the
  incorrect fair.s file that Michel sent me (which has all the command
  line options in it), and a pre-processed "fair.i" source file that I
  generated and that *should* match the configuration that was the
  source for that result. So there might be some version/configuration
  skew there between the two files, but I think they match.

  Holler if you cannot reproduce the problem based on that."

so if the attached *.i and resulting buggy *.s files don't match up perfectly,
that's the explanation.


[Bug target/61904] Incorrect stack red-zoning on x86-64 code generation

2014-07-25 Thread torva...@linux-foundation.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61904

--- Comment #7 from Linus Torvalds  ---
Just for completeness for the original bug report, here's the actual compiler
command line that the kernel uses to generate the *.s file.

  gcc -Wp,-MD,kernel/sched/.fair.s.d  -nostdinc -isystem
/usr/lib/gcc/x86_64-redhat-linux/4.8.3/include -I./arch/x86/include
-Iarch/x86/include/generated  -Iinclude -I./arch/x86/include/uapi
-Iarch/x86/include/generated/uapi -I./include/uapi -Iinclude/generated/uapi
-include ./include/linux/kconfig.h -D__KERNEL__ -Wall -Wundef
-Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common
-Werror-implicit-function-declaration -Wno-format-security -m64 -mno-mmx
-mno-sse -mno-80387 -mno-fp-ret-in-387 -mpreferred-stack-boundary=3
-mtune=generic -mno-red-zone -mcmodel=kernel -funit-at-a-time
-maccumulate-outgoing-args -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1
-DCONFIG_AS_CFI_SECTIONS=1 -DCONFIG_AS_FXSAVEQ=1 -DCONFIG_AS_CRC32=1
-DCONFIG_AS_AVX=1 -DCONFIG_AS_AVX2=1 -pipe -Wno-sign-compare
-fno-asynchronous-unwind-tables -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx
-fno-delete-null-pointer-checks -Os -Wno-maybe-uninitialized
-Wframe-larger-than=2048 -fno-stack-protector -Wno-unused-but-set-variable
-fno-omit-frame-pointer -fno-optimize-sibling-calls -g
-Wdeclaration-after-statement -Wno-pointer-sign -fno-strict-overflow
-fconserve-stack -Werror=implicit-int -Werror=strict-prototypes
-DCC_HAVE_ASM_GOTO-D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(fair)"
 -D"KBUILD_MODNAME=KBUILD_STR(fair)"  -fverbose-asm -S -o kernel/sched/fair.s
kernel/sched/fair.c

although looking at the generated .s file in the attachment, I really think
it's all there too in the comments at the top of the file.


[Bug target/61904] Incorrect stack red-zoning on x86-64 code generation

2014-07-25 Thread torva...@linux-foundation.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61904

--- Comment #8 from Linus Torvalds  ---
Oh, and this is marked as a duplicate of 61801, but that one is marked to be in
gcc-4.8.3

The particular problem we see in kernel code generation seems to *not* happen
in 4.8.3 (current fedora 20), and happens with Debian 4.9.0.

So now I worry about

 (a) whether the duplicate bug is really true

 (b) or perhaps 4.8.3 really does have the same problem and we might get bitten
there too, and it just happens to trigger on 4.9.0 for some otherwise unrelated
reason.

I'd like to have some way to tell known-bad compilers, so that we know to avoid
them..


[Bug target/61904] Incorrect stack red-zoning on x86-64 code generation

2014-07-25 Thread torva...@linux-foundation.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61904

--- Comment #10 from Linus Torvalds  ---
(In reply to Andrew Pinski from comment #9)
> 
> The bad compiler versions are 4.5.0 (when debug_insn came in) to 4.8.3 and
> 4.9.0 and 4.9.1.

Ok, so we have no reasonable way of avoiding the problem compiler version. I
had hoped that we could just do a warning if people use 4.9.0/1, since they
aren't very common yet.

Ugh. We had one suggestion of having some post-compile checking pass, but that
one was (so far) just handwaving ("objdump + perl-script"). It doesn't sound
very pleasant.

The problem is that these things are a bitch to debug - they turn into these
completely impossible kernel oopses or corruption, and we were just very lucky
that this one case happened to be repeatable and pinpoint for two people. Are
there others? We have no way of knowing..

Anyway, thanks for the quick resolution, even if I'm now rather nervous about
existing compilers..


[Bug other/48696] New: Horrible bitfield code generation on x86

2011-04-19 Thread torva...@linux-foundation.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48696

   Summary: Horrible bitfield code generation on x86
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: other
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: torva...@linux-foundation.org


gcc (tried 4.5.1 and 4.6.0) generates absolutely horrid code for some common
bitfield accesses due to minimizing the access size.

Trivial test case:

  struct bad_gcc_code_generation {
unsigned type:6,
 pos:16,
 stream:10;
  };

  int show_bug(struct bad_gcc_code_generation *a)
  {
a->type = 0;
return a->pos;
  }

will generate code like this on x86-64 with -O2:

andb$-64, (%rdi)
movl(%rdi), %eax
shrl$6, %eax
movzwl%ax, %eax
ret

where the problem is that byte access write, followed by a word access read.

Most (all?) modern x86 CPU's will come to a screeching halt when they see a
read that hits a store buffer entry, but cannot be fully forwarded from it. The
penalty can be quite severe, and this is _very_ noticeable in profiles.

This code would be _much_ faster either using an "andl" (making the store size
match the next load, and thus forwarding through the store buffer), or by
having the load be done first. 

(The above code snippet is not the real code I noticed it on, obviously, but
real code definitely sees this, and profiling shows very clearly how the 32-bit
load from memory basically stops cold due to the partial store buffer hit)

Using non-native accesses to memory is fine for loads (so narrowing the access
for a pure load is fine), but for store or read-modify-write instructions it's
almost always a serious performance problem to try to "optimize" the memory
operand size to something smaller.

Yes, the constants often shrink, but the code becomes *much* slower unless you
can guarantee that there are no loads of the original access size that follow
the write.


[Bug other/48696] Horrible bitfield code generation on x86

2011-04-19 Thread torva...@linux-foundation.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48696

--- Comment #1 from Linus Torvalds  2011-04-20 
04:28:18 UTC ---
Side note, this is talked about in the Intel optimization manual

  3.6.4 Store Forwarding

Under "General Optimization Guidelines" -> "Optimizing Memory Accesses" ->
"Store Forwarding".

I'm sure people are aware of it, but I'm not certain people realize how
horrible the performance hit can be. I have to admit to having been a bit
surprised myself just _how_ clearly it shows up in profiles.


[Bug rtl-optimization/48696] Horrible bitfield code generation on x86

2011-04-20 Thread torva...@linux-foundation.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48696

--- Comment #7 from Linus Torvalds  2011-04-20 
15:30:17 UTC ---
(In reply to comment #2)
> 
> I'm not sure where to best address this, rather than throwing in again
> the idea of lowering bitfield accesses early on trees.

So my gut feel is that getting rid of the bitfield as early as possible, and
turning all bitfield accesses into regular load/shift/mask/store operations is
always the right thing to do.

I also think that doing it with the size that the user specified is generally a
good idea, ie I sincerely hope that gcc hasn't thrown away the "unsigned int"
part of the type when it does the lowering of the bitfield op.

If gcc has forgotten the underlying type, and only looks at the bitfield size
and offset, gcc will likely never do a good job at it unless gcc gets _really_
smart and looks at all the accesses around it and decides "I need to do these
as 'int' just because (ie in the example, the "unsigned" base type is as
important as is the "bits 0..5" range information).

So I suspect it's better to just do a totally mindless expansion of bitfield
accesses early, and then use all the regular optimizations on them. Rather than
keep them around as bitfields and try to optimize at some higher level.

In an ironic twist, the real program that shows this optimization problem is
"sparse" (the kernel source code checker), which can actually do a "linearize
and optimize" the test-case itself, and in this case does this all better than
gcc (using its "dump the linearized IR" test-program):

  [torvalds@i5 ~]$ ./src/sparse/test-linearize test.c 
  test.c:7:5: warning: symbol 'show_bug' was not declared. Should it be static?
  show_bug:
  .L0x7f4cf7b93010:

load.32 %r2 <- 0[%arg1]
and.32  %r3 <- %r2, $-64
store.32%r3 -> 0[%arg1]
lsr.32  %r7 <- %r3, $6
cast.32 %r8 <- (16) %r7
ret.32  %r8

Heh. Sparse may get a lot of other things wrong, but it got this particular
case right.


[Bug rtl-optimization/48696] Horrible bitfield code generation on x86

2011-04-20 Thread torva...@linux-foundation.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48696

--- Comment #11 from Linus Torvalds  2011-04-20 
16:16:52 UTC ---
(In reply to comment #8)
> 
> Unfortunately the underlying type isn't easily available (at least I didn't
> yet find it ...).  But I suppose we have to guess anyway considering
> targets that don't handle unaligned accesses well or packed bitfields.
> Thus, an idea was to use aligned word-size loads/stores and only at the
> start/end of a structure fall back to smaller accesses (for strict align
> targets).  

That sounds fine.

The only reason to bother with the "underlying type" is that I suspect it could
be possible for educated programmers to use it as a code generation hint. IOW,
if all the individual fields end up fitting nicely in "char", using that as a
base type (even if the _total_ fields don't fit in a single byte) might be a
good hint for the compiler that it can/should use byte accesses and small
constants.

But using the biggest aligned word-size is probably equally good in practice.

And if you end up narrowing the types on _reads_, I think that's fine on x86. I
forget the exact store buffer forwarding rules (and they probably vary a bit
between different microarchitectures anyway), but I think almost all of them
support forwarding a larger store into a smaller (aligned) load.

It's just the writes that should normally not be narrowed.

(Of course, sometimes you may really want to narrow it. Replacing a

   andl $0xff00,(%rax)

with a simple

   movb $0,(%rax)

is certainly a very tempting optimization, but it really only works if there
are no subsequent word-sized loads that would get fouled by the write buffer
entry.


[Bug other/49095] New: Horrible code generation for trivial decrement with test

2011-05-20 Thread torva...@linux-foundation.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095

   Summary: Horrible code generation for trivial decrement with
test
   Product: gcc
   Version: 4.5.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: other
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: torva...@linux-foundation.org


This trivial code:

  extern void fncall(void *);

  int main(int argc, char **argv)
  {
if (!--*argv)
fncall(argv);
return 0;
  }

compiles into this ridiculous x86-64 assembly language:

movq(%rsi), %rax
subq$1, %rax
testq%rax, %rax
movq%rax, (%rsi)
je.L4

for the "decrement and test result" at -O2. 

I'd have expected that any reasonable compiler would generate something like

decq(%rsi)
je.L4

instead, which would be smaller and faster (even a "subq $1" would be fine, but
the decq is one byte shorter).

The problem is more noticeable when the memory location is a structure offset,
when the "load+decrement+store" model really results in relatively much bigger
code due to the silly repetition of the memory address, for absolutely no
advantage.

Is there some way that I haven't found to make gcc use the rmw instructions?


[Bug rtl-optimization/49095] Horrible code generation for trivial decrement with test

2011-05-21 Thread torva...@linux-foundation.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095

--- Comment #2 from Linus Torvalds  2011-05-21 
18:41:15 UTC ---
(In reply to comment #1)
>
> On the RTL side combine tries to do
> 
> Trying 7, 8 -> 9:
> Failed to match this instruction:
> (parallel [
> (set (mem/f:DI (reg/v/f:DI 63 [ argv ]) [2 *argv_1(D)+0 S8 A64])
> (plus:DI (mem/f:DI (reg/v/f:DI 63 [ argv ]) [2 *argv_1(D)+0 S8
> A64])
> (const_int -1 [0x])))
> (set (reg/f:DI 60 [ D.2723 ])
> (plus:DI (mem/f:DI (reg/v/f:DI 63 [ argv ]) [2 *argv_1(D)+0 S8
> A64])
> (const_int -1 [0x])))
> ])
> 
> because we have a use of the decrement result in the comparison.  It doesn't
> try to combine this with the comparison though.

Why isn't there a trivial pattern for the combination of "add+cmp0"? It sounds
like a peephole optimization to me.

> So this case is really special ;)  Without the use of the decremented
> value we get the desired subq $1, (%rsi).

The whole notion of "decrement and check if zero" is just about as special as
mud. 

And I realize that without the "check if zero" part I get the single rmw
instruction, but I was really hoping that gcc would get this kind of really
obvious code right. There is absolutely no question about what the correct
result is, and gcc simply doesn't generate it.

I'm used to gcc sometimes being confused by more complicated things (inline
asms, bitfields etc), but this is really basic code.

The load-store model is fine for a Pentium 4 - those things were not very good
at complex instructions. But it generates horribly big code, and modern x86
chips all want the "operate on memory" version.

> Manually sinking the store to *argv into the if and the else yields

Yeah. And that's pretty horrible. 

> As usual combine doesn't like stores.

Is there some reason this can't just be a peephole pattern?

I really thought that gcc has done this before. 

   Linus


[Bug rtl-optimization/49095] Horrible code generation for trivial decrement with test

2011-05-21 Thread torva...@linux-foundation.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095

--- Comment #3 from Linus Torvalds  2011-05-21 
20:42:26 UTC ---
Hmm. Looking at that code generation, it strikes me that even with the odd load
store situation, why do we have that "test" instruction?

   c:8b 10mov(%eax),%edx
   e:83 ea 01 sub$0x1,%edx
  11:85 d2test   %edx,%edx
  13:89 10mov%edx,(%eax)
  15:74 09je 20 

iow, regardless of any complexities of the store, that "sub + test" is just
odd. Gcc knows to simplify that particular sequence in other situations, why
doesn't it simplify it here?

IOW, I can make gcc generate code like

   c:83 e8 01 sub$0x1,%eax
   f:75 07jne18 

with no real problem when it's in registers. No "test" instruction after the
sub. Why does that store matter so much?

It looks like the combine is bring driven by the conditional branch, and then
when the previous instruction from the conditional branch is that store,
everything kind of goes to hell.

Would it be possible to have a peephole for the "arithmetic/logical +
compare-with-zero" case (causing us to just drop the compare), and then have a
separate peephole optimization that triggers the "load + op + store with dead
reg" and turns that into a "op to mem" case?

The MD files do make me confused, so maybe there is some fundamental limitation
to the peephole patterns that makes this impossible?


[Bug rtl-optimization/49095] Horrible code generation for trivial decrement with test

2011-05-27 Thread torva...@linux-foundation.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095

--- Comment #6 from Linus Torvalds  2011-05-27 
14:15:25 UTC ---
Jakub - the patch looks sane, although I don't currently have a gcc build
environment to actually test it with, and there is no way I'm going to claim
that I follow all the matches and understand why you have that third pattern
with SWI12 (but I'm guessing it's because the op and the test are being done in
the "explicit cast to integer" mode).

One thing I wanted to check, though: I hope everybody is aware that

   op $x,mem

is not identically the same as

   mov mem,reg
   op $x, reg
   mov reg,mem
   test reg

WRT the carry flag. The "test" version will always clear the carry flag, while
the "op" version will obviously set it according to the "op". For the logical
ops, this isn't a problem (they also clear carry), but for "add" and "sub" this
transformation *will* result in a difference in the C flag.

Now, "carry" is meaningless when talking about compare with zero, so this
should all be trivially ok. But I bring it up because it is possible that gcc
perhaps still uses the "unsigned compare" versions with zero.

In particular, the thing to look out for would be code like

  unsigned int *a;

  if (--*a > 0)
do_something();

and if the comparison uses "jg" ("unsigned greater than") for the comparison,
it will get the wrong end result.

Now, unsigned compares against zero are always expressible as "ne" or "eq", so
this is not a fundamental problem. In fact, my trivial testing with a few cases
seems to imply that gcc already does that conversion to inequality, and you'd
obviously have exactly the same issue with eliding the "test" instruction for
the cases you already handle (when things are in registers).

So I think everything is fine - but I wanted to mention it explicitly in case
it makes you go "ooh, yes, there are cases when this is an issue"


[Bug other/49194] New: Trivially stupid inlining decisions

2011-05-27 Thread torva...@linux-foundation.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49194

   Summary: Trivially stupid inlining decisions
   Product: gcc
   Version: 4.5.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: other
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: torva...@linux-foundation.org


So gcc inlining heuristics seem to include the rule that if it's called from a
single call-site, and is static, then it should be inlined.

That is actually really horrible for some cases. 

In particular, if you have a small function that has an uncommon case handled
by a much bigger function, you absolutely do *not* want to inline the big
function because it ends up creating a total monster of a stack frame -
something that the small function on its own doesn't need.

So for example, in the kernel we have various of these kinds of situations
where we decrement a refcount or a lock, and then in the unlikely situation
that something is true, we need to so much more processing as a result. An
example of this would be "__rcu_read_unlock()", which basically does

 - decrement the per-thread "rcu_read_lock_nesting" variable
 - if that variable is now zero, *and* we have a pending RCU event, we need to
do some special complicated stuff.

The code is basically written (we have various barriers and helper macros etc
that are irrelevant to the inlining issue) as

--t->rcu_read_lock_nesting;
if (t->rcu_read_lock_nesting == 0 &&
 __builtin_expect(!!t->rcu_read_unlock_special, 0))
rcu_read_unlock_special(t);

(where "t" is the thread pointer).

And that's all. However, because gcc inlines the special case, the function
prologue ends up looking like this (that "current_task" is from our inline asm,
it's just us getting the thread variable):

  __rcu_read_unlock:
pushq   %rbp#
movq%rsp, %rbp  #,
subq$48, %rsp   #,
movq%r13, -24(%rbp) #,
movq%rbx, -40(%rbp) #,
movq%r12, -32(%rbp) #,
movq%r14, -16(%rbp) #,
movq%r15, -8(%rbp)  #,
movq %gs:current_task,%r13  #, t
decl256(%r13)   # t_15->rcu_read_lock_nesting
...

which is pretty horrid. It uses a lot of stack, and has stupid useless
instructions.

Now, I can just mark that rcu_read_unlock_special() function as "noinline", and
as a result I get this:

__rcu_read_unlock:
pushq   %rbp#
movq%rsp, %rbp  #,
movq %gs:current_task,%rdi  #, t
decl256(%rdi)   # t_15->rcu_read_lock_nesting
...

which is obviously much superior code for the case that actually matters.

So rather than have to find all of these things manually (yes, those things do
actually show up on profiles - the stack expansion in particular ends up
showing up as extra costs), maybe gcc could just have a simple check:

 - if the call is in an unlikely section
 - .. and the callee is much bigger (in stack frame or whatever) than the
caller
 - .. simply don't inline

Hmm? I realize that this may sound specialized, but I've been looking at kernel
profiles for the last few weeks now, and this particular issue has come up in
two independent hotspots. It turns out that excessive stack use is *expensive*.
It's not just the normal "the kernel stack is limited", it's actually a matter
of "the L1 D$ is pretty small, and a big stack frame actually causes real
costs".

I really didn't expect register saves on the stack to show up in my profiles,
but they actually do. I expect the stack to basically always be in the L1 D$
and dirty (so that an access to it should be almost free), but that is only
true for a _dense_ stack. Not for leaf functions that then have extra stack
frame wasting code in them.

Comments?


[Bug rtl-optimization/49095] Horrible code generation for trivial decrement with test

2011-05-27 Thread torva...@linux-foundation.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095

--- Comment #8 from Linus Torvalds  2011-05-27 
15:32:21 UTC ---
(In reply to comment #7)
>
> BTW, the patch bootstrapped/regtested on both x86_64-linux and i686-linux, I'm
> just running second set of bootstrap without the patch to be able to compare
> how much the patch changed code on gcc itself.

I'd love to hear how well it works - I'm in the middle of a busy merge window
and about to leave for a trip to Japan, otherwise I'd just try to set up a gcc
build myself right now.

(Actually, I have a copy of a git tree, but that one fails with

  /usr/bin/ld: ../libiberty/libiberty.a(argv.o): relocation R_X86_64_32S
against `_sch_istable' can not be used when making a shared object; recompile
with -fPIC

and due to the merge window I don't really have time to try to figure it out)

Thanks,

 Linus


[Bug other/49194] Trivially stupid inlining decisions

2011-05-27 Thread torva...@linux-foundation.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49194

--- Comment #6 from Linus Torvalds  2011-05-27 
16:38:22 UTC ---
(In reply to comment #3)
> 
> -finline-functions-called-once  is trottled down by the large-function-growth
> and large-stack-frame-growth limits. The  Kernel case coupld proably be 
> handled
> by the second. Does kernel bump down that limits?

We used to play with inlining limits (gcc had some really bad decisions), but
the meaning of the numbers kept changing from one gcc version to another, and
the heuristics gcc used kept changing too. Which made it practically impossible
to use sanely - you could tweak it for one particular architecture, and one
particular version of gcc, but it would then be worse for others.

Quite frankly, with that kind of history, I'm not very eager to start playing
around with random gcc internal variables again.

So I'd much rather have gcc have good heuristics by default, possibly helped by
the kinds of obvious hints we can give ("unlikely()" in particular is something
we can add for things like this).

Obviously, we can (and do) use the "force the decision" with either "noinline"
or "__always_inline" (which are just the kernel macros to make the gcc
attribute syntax slightly more readable), but since I've been doing those other
bug reports about bad gcc code generation, I thought I'd point out this one
too.

> It still won't help in case function doesn't have any on-stack aggregates,
> since we optimistically assume that all gimple registers will disappear.
> Probably
> even that could be change, though estimating reload's stack frame usage so
> early would
> be iffy.

Yes, early stack estimation might not work all that well.

In the kernel, we do end up having a few complex functions that we basically
expect to inline to almost nothing - simply because we end up depending on
compile-time constant issues (sometimes very explicitly, with
__builtin_constant_p() followed by a largish "switch ()" statement).

That said, this is something where the call-site really can make a big
difference. Not just the fact that the call site might be marked "unlikely()"
(again, that's just the kernel making __builtin_expect() readable), but things
like "none of the arguments are constants" could easily be a good heuristic to
use as a basis for whether to inline or not.

IOW, start out with whatever 'large-stack-frame-growth' and
'large-function-growth' values, but if the call-site is in an unlikely region,
cut those values in half (or whatever). And if none of the arguments are
constants, cut it in half again.

This is an example of why giving these limits as compiler options really
doesn't work: the choice should probably be much more dynamic than just a
single number.

I dunno. As mentioned, we can fix this problem by just marking things noinline
by hand. But I do think that there are fairly obvious cases where inlining
really isn't worth it, and gcc might as well just get those cases right.


[Bug rtl-optimization/49095] Horrible code generation for trivial decrement with test

2011-05-27 Thread torva...@linux-foundation.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095

--- Comment #10 from Linus Torvalds  2011-05-27 
16:48:52 UTC ---
(In reply to comment #9)
> 
> 32-bit before32-bit after64-bit before64-bit after
> libstdc++.so.60x717080x716e80x67ee60x67ec6
> libgcj.so.120xa3ec580xa3eb980x90cd680x90cce8
> cc1plus0xe8a29c0xe8a25c0xdccf980xdccf08

Ok, that's much less noticeable than I was hoping for.

That said, even for the kernel, the only reason I noticed this problem was not
because I've seen a lot of them per se, but because the pattern showed up in a
very hot function. In fact, it's the same __rcu_read_unlock() function that I
made the otherwise unrelated bugzilla entry for inlining: 

  http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49194

so it's quite possible that we don't have that many of them in the kernel
either.


[Bug rtl-optimization/49095] Horrible code generation for trivial decrement with test

2011-05-29 Thread torva...@linux-foundation.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095

--- Comment #13 from Linus Torvalds  2011-05-29 
18:56:44 UTC ---
Thanks guys.