Re: GCC 4.2.0 Status Report (2007-02-19)

2007-02-20 Thread Andi Kleen
"Kaveh R. GHAZI" <[EMAIL PROTECTED]> writes:
> 
> And we don't want to arm our detractors with bad SPEC numbers.  I can just
> imagine the FUD spreading... we've got to fix it or backout.

For me as a gcc user miscompilations are far worse than bad SPEC numbers.
I never run SPEC, but if my programs are miscompiled I am in deep trouble.
I expect many other people to feel similar. Broken programs are infinitely
worse than slower programs.

If you really have any detractors(?) they will get much more meat out
of miscompilations than out of any SPEC numbers.

I find it amazing that this needs stating.

-Andi


Re: subreg pass

2007-03-06 Thread Andi Kleen
Roman Zippel <[EMAIL PROTECTED]> writes:


> The new subreg lowering pass seems to generate a bit worse code on m68k 
> than before, let's take simple example:

FWIW 

I see the opposite on i386: I have a function that strangely ran
slower with -fomit-frame-pointer than without on 4.1. With a 4.3
snapshot -fomit-frame-pointer is faster as expected.

-Andi


Re: Google SoC Project Proposal: Better Uninitialized Warnings

2007-03-18 Thread Andi Kleen
"Manuel López-Ibáñez" <[EMAIL PROTECTED]> writes:

> This is the project proposal that I am planning to submit to Google
> Summer of Code 2007. It is based on previous work of Jeffrey Laws,
> Diego Novillo and others. I hope someone will find it interesting and
> perhaps would like to act as mentor. Feedback is very welcome (eve
> negative feedback!!). I have other ideas as well, so I would like to
> know if this is a bad project or I am stepping on someone's toes as
> soon as possible.

That would be very very useful. Spurious uninitialized warnings in newer 
gccs are plaguing the Linux kernel at least.

-Andi


Re: RFC: GIMPLE tuples. Design and implementation proposal

2007-04-10 Thread Andi Kleen
Andrew MacLeod <[EMAIL PROTECTED]> writes:
> 
> Do you mean implement this as a single linked list and then to find the
> previous instruction, start at the beginning of the block and traverse
> forward? Back in the early days of tree-ssa we did such a thing with the
> tree iterators. It was too slow. there are many LARGE basic blocks which
> make the computation too expensive for any pass that goes backwards
> through a block.

In theory you could use xor lists (xor next and prev and if you know
one element you can get the other). Problem is that garbage collectors
require special casing for them and debuggers need special macros 
to walk those.

-Andi


Re: RFC: GIMPLE tuples. Design and implementation proposal

2007-04-11 Thread Andi Kleen
Richard Henderson <[EMAIL PROTECTED]> writes:

> On Tue, Apr 10, 2007 at 11:13:44AM -0700, Ian Lance Taylor wrote:
> > The obvious way to make the proposed tuples position independent would
> > be to use array offsets rather than pointers.
> 
> I suggest instead, if we want something like this, that we make
> the references be pc-relative.  So something like

If you go this way (and require special GC/debugger support) you
could as well xor next/prev too and save another field.

Adding a xor is basically free and much cheaper than any cache miss
from larger data structures.

The only thing that wouldn't work is that when you have a pointer
to an arbitary element (without starting from beginning/end first)
you couldn't get previous or next. 

-Andi


Re: RFC: GIMPLE tuples. Design and implementation proposal

2007-04-11 Thread Andi Kleen
On Wed, Apr 11, 2007 at 04:55:02PM +0100, Dave Korn wrote:
> On 11 April 2007 12:53, Andi Kleen wrote:
> 
> > Richard Henderson <[EMAIL PROTECTED]> writes:
> > 
> >> On Tue, Apr 10, 2007 at 11:13:44AM -0700, Ian Lance Taylor wrote:
> >>> The obvious way to make the proposed tuples position independent would
> >>> be to use array offsets rather than pointers.
> >> 
> >> I suggest instead, if we want something like this, that we make
> >> the references be pc-relative.  So something like
> > 
> > If you go this way (and require special GC/debugger support) you
> > could as well xor next/prev too and save another field.
> > 
> > Adding a xor is basically free and much cheaper than any cache miss
> > from larger data structures.
> 
>   Using a delta is even better than an XOR, because it remains constant when
> you relocate the data.

You can xor deltas as well as pointers. The concepts are independent.

Delta alone doesn't save anything on 32bit -- and on 64bit only when you
accept < 2GB data (which might be reasonable for a basic block, but
probably not for a full function). 

xor always halves the space needed for links.

>  
> > The only thing that wouldn't work is that when you have a pointer
> > to an arbitary element (without starting from beginning/end first)
> > you couldn't get previous or next.
> 
>   You need a pointer to two consecutive nodes to act as an iterator.
> 
>   However we already discussed the whole idea upthread and the general feeling
> was that it's a whole bunch of tricky code for a small saving, so not worth

Is it tricky? You just need some abstraced iterators and a special case
in the GC.   And some gdb macros.

I wrote a couple of xor list implementations in the past
and didn't fid it particularly tricky, but given that was without
any GC.

> doing in the first iteration.  Maybe as a future refinement.  See the earlier
> posts in this thread for the discussion.

When you need to go over all the code anyways it would make 
sense to do such changes in one go. Or at least move to abstracted
iterators early :- if you do that then changing the actual data structure
later is relatively easy. Both xor list and deltas can be hidden with some
care.

-Andi



Re: RFC: #pragma optimization_level

2005-04-01 Thread Andi Kleen
Dale Johannesen <[EMAIL PROTECTED]> writes:

> I've currently got the job of implementing pragma(s) to change
> optimization level in the middle of a file.  This has come up a few
> times before,

Would it be possible to make it a function attribute? I would
really like to have

void __attribute__((optimization("Os"))) slow_init_function(void)
{
}

to let slow_init_function be compiled as space efficient as possible.

I think that would be very useful for the Linux kernel at least.
There is already a __init modifier for functions that ought to be
freed after initialization and it would be nice to just add a one
liner to make them smaller too.

In addition letting the compiler do that automatically based
on profile feedback for "cold" functions would be very cool.

-Andi


Re: Getting rid of -fno-unit-at-a-time [Was Re: RFC: Preserving order of functions and top-level asms via cgraph]

2005-04-12 Thread Andi Kleen
Daniel Jacobowitz <[EMAIL PROTECTED]> writes:

> On Mon, Apr 11, 2005 at 10:02:06AM -0700, Daniel Kegel wrote:
>> BTW, I hope -fno-unit-at-a-time doesn't go away until at least gcc-4.1.1
>> or so... I still lean on that crutch.
>
> A user!  Can you explain why?

The x86-64 2.4 linux kernel uses it too, because some code relies on 
the ordering between asm and several functions.
Other Linux ports might be similarly affected. 2.6 is fixed, but the
fix is a bit too big to merge into a stable-nearly-dead tree like 2.4.


-Andi


Re: Getting rid of -fno-unit-at-a-time [Was Re: RFC: Preserving order of functions and top-level asms via cgraph]

2005-04-13 Thread Andi Kleen
> Thanks.  I was under the impression that 2.4 doesn't build with GCC
> HEAD, anyway - I saw some patches pile up and not get applied.

AFAIK some gcc 4.0 patches went in. I assume it will build eventually
at least once 4.0 is released.

> Does 2.6 still use the option?

No.

However the change needed for that is really too big to backport, so 
I would prefer not to.

-Andi


Re: GCC 4.1: Buildable on GHz machines only?

2005-05-07 Thread Andi Kleen
Tom Tromey <[EMAIL PROTECTED]> writes:
>
> Splitting up libgcj.so probably makes sense even for the Linux distro
> case (the one I am most concerned with at the moment), just so that
> apps that don't use AWT or Swing don't really pay for it.  The

Hmm? Unless you initialize AWT/swing in all programs that code
should never be paged in for non GUI programs. Ok in theory
if you use a random build order then a lot of pages could
contain GUI and non GUI code together, but that is probably
unlikely.

The only reason to split it out would be to allow a libgcj
installation that is not dependent on the X11 libraries on the
RPM/dpkg/etc. level for small setups, but I am not sure how useful
that is anymore for most distributions.

And perhaps to make the linking steps in the gcc build a bit
faster, but just for that it seems like a lot of work.

> overhead of, say, BC-compiling AWT and putting it in a separate .so
> doesn't seem unreasonable, based on our experiences with BC compiling
> large apps.  (Even this situation has its losers, for instance CNI
> users accessing AWT, like the xlib peers.  It looks impossible to
> please everybody 100%.)
>
> This split would help Windows users, too, I think, as at least AWT and
> Swing are fairly useless on that platform (no native peers; you can
> run the Gtk peers but that is weird).
>
> For embedded platforms, I would not mind seeing patches to allow
> completely disabling parts of the build based on configure switches.
> E.g., a switch to drop AWT entirely would be fine by me.  Someone

Shouldnt the linker do that already at runtime? Ok you pay the cost
of building it when gcc is built and storing the libraries on disk,
but in the final executable all should be only linked in when actually
referenced.

I know it can take quite some effort to make libraries really static
linking friendly so that libraries dont pull in other code intentionally
(take a look at nm of a a static glibc program to see how it should be not
done), but I would hope at least that the GUI and non GUI parts 
of libgcj are clearly separated.

-Andi


Re: GNU C++ 4.0.1/4.1.0 cache misses on MICO sources.

2005-05-17 Thread Andi Kleen
Karel Gardas <[EMAIL PROTECTED]> writes:

> I've thought that L1 and L2 DTLB misses are the most important for the
> overall performance or performance degradation, if not please correct
> me since this is my first attempt to measure and interpret such data.

TLB is just for caching the translations from virtual to physical
addresses. Normally the data/instruction cache misses are more
important. There are a few TLB intensive workloads too, but they tend
to use much more memory than gcc normally does.

So I think you should rather use ICACHE_MISSES and 
DATA_CACHE_REFILLS_FROM_SYSTEM,
which measure the "real" L2 caches.

And perhaps run a normal instruction profile (CPU_CLK_UNHALTED) in parallel and
double check the hot spots displayed by the others match the real
time hogs. Note you can use upto three performance counters at the same time.

-Andi


Re: Visual C++ style inline asms

2005-06-14 Thread Andi Kleen
Mike Stump <[EMAIL PROTECTED]> writes:

> Any objections to adding Visual C++ style inline asms?

Doesn't that need support to parse assembly? (= essentially
a builtin assembler). How else would the compiler
know what registers are clobbered and where to put input/output
variables?

All compilers i've seen who supported that had builtin
assemblers. 

-Andi


Re: GCC-4.1.0 size optimization bug for MIPS architecture...

2005-06-29 Thread Andi Kleen
"Steven J. Hill" <[EMAIL PROTECTED]> writes:

> I have discovered what appears to be an optimization bug with '-Os'
> in GCC-4.1.0 for the MIPS architecture. It appears that functions
> which are declared as 'inline' are being ignored and instead turned
> into to function calls which is breaking the dynamic linker loader
> for uClibc on MIPS.

You should mark the functions that absolutely need to be inlined
with __attribute__((always_inline)). Or just 
-Dinline="__attribute__((always_inline))" 

-Andi


Re: 4.2 Project: "@file" support

2005-08-25 Thread Andi Kleen
Mark Mitchell <[EMAIL PROTECTED]> writes:
> 
> I'm not sure what you're saying.  The limitation on command-line
> length can be in either the shell, or the OS.  In Windows 2000, the
> limitation comes primarily from the Windows command shell.

Linux has a similar limit which comes from the OS (normally around 32k) 
So it would be useful there for extreme cases too.

-Andi


Re: 4.2 Project: "@file" support

2005-08-25 Thread Andi Kleen
On Thursday 25 August 2005 12:45, Jan-Benedict Glaw wrote:

> Linux uses 32 pages, which results in 128KB on 4K architecture (eg.
> i386, m68k, sparc32, ...) or 256KB on 8K architectures (eg. Alpha,
> UltraSparc, ...).

Yes you're right. Somehow I only remembered the number 32.
Anyways, it's still too small. I regularly run into the limit, although
normally not with gcc.

-Andi


Re: [cft] aligning main's stack frame

2005-10-17 Thread Andi Kleen
Richard Henderson <[EMAIL PROTECTED]> writes:

> main:
> leal4(%esp), %ecx # create argument pointer
> andl$-16, %esp# align stack
> pushl   -4(%ecx)  # copy return address

This will misaligned the call/ret stack in the CPU, leading to branch
mispredictions on many of the following RETs. On main it's probably
not a big issue, but for other functions it might be.

-Andi


Re: GCC-generated code and i386 condition codes behavior

2005-11-04 Thread Andi Kleen
Robert Dewar <[EMAIL PROTECTED]> writes:
> 
> I must say I am a bit surprised that gcc never takes advantage
> of the fact that inc and dec do not destroy the carry flag, this
> is quite significant for some loops.

A lot of this old wisdom is no longer true.

inc and dec are to be generally avoided these days, because the
partial changing of EFLAGS causes too strict dependencies on Intel P4
cores and threefore slower execution. It's better to use add/sub
$1,... instead, except if you're optimizing for code size.

-Andi


Re: Mapping C code to generated asm code

2006-01-11 Thread Andi Kleen
Ron McCall <[EMAIL PROTECTED]> writes:

> On Wed, Jan 11, 2006 at 12:03:42PM -0600, Perry Smith wrote:
> > Is there a way to get some type of debugging output that tells me  
> > what line of C code produced what lines of asm code?
> 
> How about $TARGET-objdump --disassemble --source?

That's totally useless since -funit-at-a-time

-Andi


Re: Builtin expansion versus headers optimization: Reductions

2015-06-04 Thread Andi Kleen
Ondřej Bílka  writes:

> As I commented on libc-alpha list that for string functions a header
> expansion is better than builtins from maintainance perspective and also
> that a header is lot easier to write and review than doing that in gcc
> Jeff said that it belongs to gcc. When I asked about benefits he
> couldn't say any so I ask again where are benefits of that?

It definitely has disadvantages in headers. Just today I was prevented
from setting a conditional break point with strcmp in gdb because
it expanded to a macro including __extension__ which gdb doesn't
understand.

The compiler has much more information than the headers.

- It can do alias analysis, so to avoid needing to handle overlap
and similar.
- It can (sometimes) determine alignment, which is important
information for tuning.
- With profile feedback it can use value histograms to determine the
best code.

It may not use all of this today, but it could.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only


Re: Builtin expansion versus headers optimization: Reductions

2015-06-05 Thread Andi Kleen
Ondřej Bílka  writes:
>
> On ivy bridge I got that Using rep stosq for memset(x,0,4096) is 20%
> slower than libcall for L1 cache resident data while 50% faster for data
> outside cache. How do you teach compiler that?

It would be in theory possible with autofdo. Profile with a cache miss
event. Correlate. Maintain the information in addition to the basic
block frequencies.

Probably not simple, but definitely possible.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only


Re: [RFC] Kernel livepatching support in GCC

2015-06-09 Thread Andi Kleen
> > As I am bit concerned with performance why require nops there? Add a
> > byte count number >= requested thats boundary of next instruction. When
> > lifepatching for return you need to copy this followed by jump back to next
> > instruction. Then gcc could fill that with instructions that don't
> > depend on address, fill with nops as trivial first implementation.
> > 
> > Would that be possible?
> 
> So instead of placing NOPs to be overwritten you intend to simply overwrite 
> the existing code after
> making a backup of it? 

This is how Linux k/uprobes work. But it only works for a subset of 
instructions and is
fairly complicated because you need a complete decoder that is able to adjust 
any program counter
relative data offsets. Having a patch area is far easier and more reliable.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only


Re: Moving to git

2015-08-26 Thread Andi Kleen
Jason Merrill  writes:
>
> You don't even need to worry about the hash code, you can use the
> timestamp by itself.  Given the timestamp,
>
>   git log -1 --until 1440153969

Consider tree merges. There's no guarantee a time stamp maps to
monotonically increasing commit numbers.

But I don't really understand the problem. Just run git log
and generate your own list of numbers. Should be straight forward
to script.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only


Re: Thread-safety of a profiled binary (and GCOV runtime library)

2016-07-25 Thread Andi Kleen
Martin Liška  writes:
>
> I'm also surprised about it :) Let's start without invention of a new flag, 
> I'll work on that.

You definitely need a new flag: atomic or per thread instrumentation
will almost certainly have significant overhead (either at run time
or in memory). Just making an existing facility a lot of slower
without a way around it is not a good idea.

BTW iirc there were patches from google on this a few years back.
May be still in some branch.

-Andi


Re: question about -ffast-math implementation

2014-06-02 Thread Andi Kleen
Mike Izbicki  writes:

> Right, but I've never taken a look at the gcc codebase.  Where would I
> start looking for the relevant files?  Is there a general introduction
> to the codebase anywhere that I should start with?

grep for all the flags set in the two functions below (from gcc/opts.c),
without the x_ to catch all:

/* The following routines are useful in setting all the flags that
   -ffast-math and -fno-fast-math imply.  */
static void
set_fast_math_flags (struct gcc_options *opts, int set)
{
  if (!opts->frontend_set_flag_unsafe_math_optimizations)
{
  opts->x_flag_unsafe_math_optimizations = set;
  set_unsafe_math_optimizations_flags (opts, set);
}
  if (!opts->frontend_set_flag_finite_math_only)
opts->x_flag_finite_math_only = set;
  if (!opts->frontend_set_flag_errno_math)
opts->x_flag_errno_math = !set;
  if (set)
{
  if (!opts->frontend_set_flag_signaling_nans)
opts->x_flag_signaling_nans = 0;
  if (!opts->frontend_set_flag_rounding_math)
opts->x_flag_rounding_math = 0;
  if (!opts->frontend_set_flag_cx_limited_range)
opts->x_flag_cx_limited_range = 1;
}
}

/* When -funsafe-math-optimizations is set the following
   flags are set as well.  */
static void
set_unsafe_math_optimizations_flags (struct gcc_options *opts, int set)
{
  if (!opts->frontend_set_flag_trapping_math)
opts->x_flag_trapping_math = !set;
  if (!opts->frontend_set_flag_signed_zeros)
opts->x_flag_signed_zeros = !set;
  if (!opts->frontend_set_flag_associative_math)
opts->x_flag_associative_math = set;
  if (!opts->frontend_set_flag_reciprocal_math)
opts->x_flag_reciprocal_math = set;
}

-- 
a...@linux.intel.com -- Speaking for myself only


Re: GCC version bikeshedding

2014-07-20 Thread Andi Kleen
Paulo Matos  writes:
>
> That's what I understood as well. Someone mentioned to leave the patch
> level number to the distros to use which sounded like a good idea.

Sounds like a bad idea, as then there would be non unique gcc versions.
redhat gcc 5.0.2 potentially being completely different from suse gcc
5.0.2

-Andi



Re: Skipping assembler when producing slim LTO files

2014-09-24 Thread Andi Kleen
Jan Hubicka  writes:

Nice patch.

> The implementation is pretty straighforward except for -fbypass-asm requiring
> one existing OBJ file to fetch target's file attributes from.  This is
> definitly not optimal, but libiberty currently can't build output files from
> scratch. As Ian suggested, I plan to simply arrange the driver to pass 
> crtbegin
> around at least to start with. We may want to bypass this later and storing
> proper attributes into the binary.

I wonder how hard it would be to fix simple-object to be able to create
from scratch. From a quick look it would be mostly adding the right
values into the header? That would need some defines per target.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only


Re: Autotuning parameters/heuristics within gcc - best place to start?

2014-09-26 Thread Andi Kleen
Robert Stevenson <9goli...@gmail.com> writes:

> I am planning to begin a project to explore the space of tuning gcc
> internals in an effort to increase performance. I am wondering if
> anyone can point to me any parameterizations, heuristics, or
> priorities functions within gcc that can be tuned/adjusted. Some
> general areas I am considering include register allocation and data
> prefetching.

gcc has lots of params (see gcc -v --help=params) for different
areas that can be tuned. If you need more there's a generic 
parameters frame work to add. And of course there are lots
of command line flags (see --help)

One example of an existing autotuner is the gccflags tuner in opentuner.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only


graphite in -O3

2014-11-16 Thread Andi Kleen

Is there any specific reason why none of the graphite loop optimizations
(loop-block, loop-interchange, loop-strip-mine, loop-jam)
are enabled with -O3 or -Ofast?

I assume doing so would make them much more widely used.

Perhaps would be something to consider for 5.0?

-Andi



Re: Missed optimization case

2014-12-23 Thread Andi Kleen
Matt Godbolt  writes:
>
> I'll happily file a bug if necessary but I'm not clear in what phase
> the optimization opportunity has been missed.

Please file a bug with a test case. No need to worry about the phase
too much initially, just fill in a reasonable component.

-Andi


Re: Why not implementation of interrupt attribute on IA32/x86-64

2015-03-13 Thread Andi Kleen
Didier Garcin  writes:

> many OS hobbyist developpers would be pleased GCC implements the
> interrupt or interrupt_handler attribute for Intel architecture.
>
> Would it be so difficult to implement for this architecture ?

There are lots of different ways to implement interrupts on x86
(e.g. what state to save, what registers to set up). It would
be unlikely that gcc would select a subset that worked for most
people.

You're better off with a assembler wrapper that does exactly the
setup you want.

-Andi


Re: AutoFDO profile toolchain is open-sourced

2015-04-21 Thread Andi Kleen
Ilya Palachev  writes:
>
> But why create_gcov does not inform about that (no branch events were
> found)? It creates empty gcov file and says nothing :(
>
> Moreover, in the mentioned README it is said that perf should also be
> executed with option -e BR_INST_RETIRED:TAKEN.

Standard perf doesn't have a full event list
This assumes a perf patched with the libpfm patch.

Also I suspect it really wants to use PEBS events, so pp should be added.

Alternatively you can use ocperf (from
http://github.com/andikleen/pmu-tools) which is just a wrapper:

ocperf.py record -e br_inst_retired.near_taken:pp -b ... 

or specify the event manually (depending on your CPU, like)

perf record -e
cpu/event=0xc4,umask=0x20,name=br_inst_retired_near_taken,period=49/pp
-b ...

BTW the biggest problem with autofdo currently is that it is
quite bitrotten and supports only several years old perf.
So all of this above will only work with old distributions,
unless you compile an old perf utility first.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only


Re: AutoFDO profile toolchain is open-sourced

2015-04-21 Thread Andi Kleen
> > BTW the biggest problem with autofdo currently is that it is
> > quite bitrotten and supports only several years old perf.
> > So all of this above will only work with old distributions,
> > unless you compile an old perf utility first.
> 
> Do you mean newer perf does not support LBR (-b) any more?

No.

perf extended its perf.data output format, and quipper cannot parse
any of the extensions, so it just bombs out with assertation
failures.

I have a patch to hack around some of this, but still
couldn't get it actually to work so far.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.


Re: AutoFDO profile toolchain is open-sourced

2015-04-21 Thread Andi Kleen
On Tue, Apr 21, 2015 at 10:27:49AM -0700, Dehao Chen wrote:
> In that case, we should get quipper fixed upstream to accommodate new
> format. (Maybe they already fixed it, I will do a batch sync to make
> quipper up-to-date).

>From a quick look at 

http://git.chromium.org/gitweb/?p=chromiumos/platform/chromiumos-wide-profiling.git;a=summary

(I assume that is what you mean with upstream)

it hasn't been updated. Is still stuck in 2013.

I'm attaching what patches I have so far.

-Andi


autofdo-newer-perf-0.tgz
Description: application/gtar-compressed


Re: AutoFDO profile toolchain is open-sourced

2015-04-21 Thread Andi Kleen
On Tue, Apr 21, 2015 at 01:52:18PM -0700, Dehao Chen wrote:
> Andi,
> 
> Thanks for the patches. Turns out that the first 3 patches are already
> in, the correct upstream quipper repository is:
> 
> https://chromium.googlesource.com/chromiumos/platform2/+/master/chromiumos-wide-profiling/
> 
> The last 3 patches seem to be local hacks. Do you want any of them in?
> 
> I just did a batch sync with quipper head. Please let me know if this
> solves the perf problem.

Still outdated:

F0421 20:13:16.221422 22297 perf_reader.cc:1614] Check failed: attr_size <= 
sizeof(perf_event_attr) (104 vs. 96) 

-Andi


Re: AutoFDO profile toolchain is open-sourced

2015-04-21 Thread Andi Kleen
On Wed, Apr 22, 2015 at 05:15:47AM +0200, Andi Kleen wrote:
> On Tue, Apr 21, 2015 at 01:52:18PM -0700, Dehao Chen wrote:
> > Andi,
> > 
> > Thanks for the patches. Turns out that the first 3 patches are already
> > in, the correct upstream quipper repository is:
> > 
> > https://chromium.googlesource.com/chromiumos/platform2/+/master/chromiumos-wide-profiling/
> > 
> > The last 3 patches seem to be local hacks. Do you want any of them in?
> > 
> > I just did a batch sync with quipper head. Please let me know if this
> > solves the perf problem.
> 
> Still outdated:
> 
> F0421 20:13:16.221422 22297 perf_reader.cc:1614] Check failed: attr_size <= 
> sizeof(perf_event_attr) (104 vs. 96) 

It converts with the attached patches, but there's still some problem
parsing the data:

% ./create_gcov  -binary loop -gcov_version 1 -gcov loop.gcda -gcov_version 
0x500e
% gcc50 -O2 -fprofile-use loop.c 
loop.c:1:0: warning: '/home/andi/src/autofdo/loop.gcda' is version ',
expected version '500e'
% 

-Andi



autofdo-patches-2.tgz
Description: application/gtar-compressed


Re: [RFC] Kernel livepatching support in GCC

2015-05-28 Thread Andi Kleen
> Our proposal is that instead of adding -mfentry/-mnop-count/-mrecord-mcount 
> options to other architectures, we should implement a target-independent 
> option -fprolog-pad=N, which will generate a pad of N nops at the beginning 
> of each function and add a section entry describing the pad similar to 
> -mrecord-mcount [1].

Sounds fine to me.

-Andi


-- 
a...@linux.intel.com -- Speaking for myself only


Re: [RFC] Kernel livepatching support in GCC

2015-06-04 Thread Andi Kleen
> Rather than just a sequence of NOP's, should the first NOP be a
> unconditional branch to the beginning of the real function?  I don't
> know if this applies to AArch64 cpus, but I believe some cpus can handle
> such branches already in the decode unit, thus avoiding any extra cycles
> for skipping the NOPs.

nops are very cheap. Typically they are already discard in the frontend.
It's unlikely all of this is worth it.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only


Re: Proposal to add FDO profile quality related diagnostics

2018-11-20 Thread Andi Kleen
Indu Bhagat  writes:

> Proposal to add diagnostics to know which functions were not run in the
> training run in FDO.

Don't you think the warning will be very noisy? I assume most programs
have a lot of cold error handling functions etc. that are never executed
in a normal execution.

Like how does it look like for a gcc build for example?

I suspect it would need some heuristics to cut it down at least, like if
the function ends with exit (perhaps propagated through the call tree)
it's not flagged.  Or if there is warning for a function don't warn
about callees that are not called from somewhere else.

-Andi


Re: Proposal to add FDO profile quality related diagnostics

2018-11-27 Thread Andi Kleen
> 
>  Regarding the function level detail being too noisy : I sort of agree with 
> that
>  comment. But I am of the opinion that I would rather leave it to the user to
>  infer the profile quality as per the application characteristics.

Makes sense I guess. But I would keep the drill down as opt-info.
opt-info is better designed for really high volume data like this.

-Andi


Re: Source code coverage of gcc

2018-12-06 Thread Andi Kleen
sameeran joshi  writes:

> Hi,
> I have a random C program as a test case, for which I need to do
> source code coverage on gcc.
> I have used the gcov tool and further the lcov tool. The percentage of
> source code coverage which I get after using gcov, Is that the final %
> which I need to do gcc source code coverage?
>
> What does it mean to build gcc with -pg option, how does that help in
> source code coverage?

lcov/gcov can only work if the code is instrumented.
So this requires building gcc with BOOT_CFLAGS="-O2 -g -pg"
(see the chapter on building gcc in the manual for more information)

You would then run the instrumented gcc, which writes coverage
data to disk. Then lcov can be used to get a summary of coverage.

-Andi


Re: Testing compiler reliability using Csmith

2018-12-06 Thread Andi Kleen
Radu Ometita  writes:

> Hello everyone!
>
> We are working on writing a paper about testing the reliability of C 
> compilers by using Csmith (a random C99 program generator).
>
> A previous testing effort, using Csmith, found 79 GCC bugs, and 25 of
> those have been marked by developers as P1
> (https://www.flux.utah.edu/download?uid=114
> ): . However, after this
> paper was published we are unaware of any further testing using
> Csmith, and we would like to ask you, if you are aware of any such
> efforts or further results.

Sameeran has been doing some additional testing with modified csmith.

There's currently no systematic effort to run csmith regularly
to find regressions.

-Andi


Re: Support for AVX512 ternary logic instruction

2019-01-20 Thread Andi Kleen
Wojciech Muła  writes:
>
> The main concern is if it's a proper approach? Seems that to match
> other logic functions, like "a & b | c", a separate pattern is required.
> Since an argument can be either negated or not, and we can use three 
> logic ops (or, and, xor) there would be 72 patterns. So maybe a new
> optimization pass would be easier to create and maintain? (Just a silly
> guess.)

Yes that's not scalable.
>
> I'd be grateful for any comments and advice.

Maybe you could write it in the simplifier pattern language
and then generate a suitable builtin.

See https://gcc.gnu.org/onlinedocs/gccint/Match-and-Simplify.html

However the problem is that this may affect other optimizations
because it happens too early. e.g. the compiler would also need
to learn to constant propagate the new builtin, and understand
its side effects, which might affect a lot of places.

So a custom compiler patch that runs late may be better.
Or perhaps some extension of the simplifier that does it.

I looked at this at some point for PCMP*STR* which are similarly
powerful instructions that could potentially replace a lot of
others.

-Andi


Re: [GSoC 2019] [extending Csmith for fuzzing OpenMp extensions]

2019-03-23 Thread Andi Kleen
On Sat, Mar 23, 2019 at 11:49:11PM +0530, sameeran joshi wrote:
> 1) check_structured_block_conditions()
>   checks for the conditions related to a structured block
>   1.no returns in block

returns should be allowed inside statement expressions.

>   2.no gotos
>   3.no breaks
> and accordingly labels the blocks as structured block, for example
>   for (){
>   //unstructured
> //block 1
> if(){
>   //unstructured
> //block 2
> if(){
> //block 3
>   //structured
>1.no gotos
>   2.no breaks
>   3.no return
>   4.Do I need to check continue as well?

Yes check for continue too.

> This applies mostly when the break,goto,return statements have some
> probability of generation.
> Another workaround I think(which would increase the generation of more
> OpenMP constructs)is to make probabilities of above statements to '0'

Sure, of course only within the structured expressions.

> For the following code:
> struct S1 {
> int f0;
> };
> global variable:
> static struct S1 g_371 = {0x54L};
> 
> void main ( ){
>   #pragma omp parallel for
>   for (g_371.f0 = (-3); (g_371.f0 >= (-27)) ; g_371.f0 =
> safe_sub_func_int32_t_s_s(g_371.f0, 2))
>   {/* block id: 1 */
>   structured block
>   }
> }
> I have following 3 questions in regards to usage of openmp.
> 
> 0.Can't I use a '(test)' a 'bracket' for a 'test' expression?
> error:invalid controlling predicate
> If I try removing the brackets the program works fine.


You mean it errors for
 for (g_371.f0 = (-3); (g_371.f0 >= (-27)) ; g_371.f0 = 
safe_sub_func_int32_t_s_s(g_371.f0, 2))

and works for

 for (g_371.f0 = (-3); g_371.f0 >= (-27) ; g_371.f0 = 
safe_sub_func_int32_t_s_s(g_371.f0, 2))

?

If yes that would seem like a gcc bug. I would file in bugzilla with a simple 
test case.


> 
> 
> 1.Can I use a struct field variable for initialization?:
> 
> Whereas the 5.0 specification states:
> var operator lb
> var := variable of signed or unsigned type | variable of pointer type
> which obeys the specification rules.
> 
> error:expected iteration declaration/initialization before 'g_371'

Ok I guess it's not supported, so you can't.

> 
> 
> 2.Consider a case where I need to increment the variable:
> 
> Considering the grammar of Csmith, in increment expression I need to
> use function safe_sub_func_int32_t_s_s(g_371.f0, 2)
> the function decrements the variable(same as g_371.f0 = g_371.f0-1 )
> but it does some checks for the bounds, which are essential for
> checking the undefined conditions.

Maybe could make the index only unsigned, then ++ would work.
We definitely need increment expressions to be useful.

And perhaps have an command line option to allow unchecked signed increment
for this.
> 
> What do you think about adding command lines so as to disable
> generation of such increment expressions so it follows spec 5.0

We need them, otherwise too much useful OpenMP won't be generated.

-Andi


Re: [GSoC 2019] [extending Csmith for fuzzing OpenMp extensions]

2019-03-25 Thread Andi Kleen
sameeran joshi  writes:

> On 3/24/19, Andi Kleen  wrote:
>> On Sat, Mar 23, 2019 at 11:49:11PM +0530, sameeran joshi wrote:
>>> 1) check_structured_block_conditions()
>>> checks for the conditions related to a structured block
>>> 1.no returns in block
>>
>> returns should be allowed inside statement expressions.
> If I am correct, we would create a new branch  "COMPsmith"(C OpenMP
> smith)(name will this work?)  from the master of Csmith and work on
> it, statement expression are in the "gcc c extensions" branch I guess
> which would be a separate branch.
>
> So it shouldn't allow return as well, right?

Yes
>> Yes check for continue too.
> referring this: http://forum.openmp.org/forum/viewtopic.php?f=3&t=458
> continue can be used but with some restriction as:
>
> "innermost associated loop may be curtailed by a continue statement "

Ah you're right. Better don't do continue then.

>> If yes that would seem like a gcc bug. I would file in bugzilla with a
>> simple test case.
>>
> Did you file it? can you please send me the bug id?

I didn't. Can you show some simple example that errors first?

Perhaps Jakub or some other OpenMP expert can also chime in.

>>
>>>
>>>
>>> 1.Can I use a struct field variable for initialization?:
>>>
>>> Whereas the 5.0 specification states:
>>> var operator lb
>>> var := variable of signed or unsigned type | variable of pointer type
>>> which obeys the specification rules.
>>>
>>> error:expected iteration declaration/initialization before 'g_371'
>>
>> Ok I guess it's not supported, so you can't.
> Any specific reason for not supporting struct variables as
> initialization, as the spec. doesn't impose any restriction

Yes it seems like a gcc restriction. I would file a bug (feature
request) that one. It's good, these are the kind of things
the OpenMP fuzzing project should be finding.


-Andi


Re: [GSoC 2019] [extending Csmith for fuzzing OpenMp extensions]

2019-03-26 Thread Andi Kleen
> That is a correct diagnostics.
> 
> See Canonical loop form.
> 
> test-expr One of the following:
>   var relational-op b
>   b relational-op var
> 
> ( var relational-op b )
> is neither of those.

Still seems strange to fail for some meaningless brackets.

But ok.

-Andi


Re: Can LTO minor version be updated in backward compatible way ?

2019-07-17 Thread Andi Kleen
Romain Geissler  writes:
>
> I have no idea of the LTO format and if indeed it can easily be updated
> in a backward compatible way. But I would say it would be nice if it
> could, and would allow adoption for projects spread on many teams
> depending on each others and unable to re-build everything at each
> toolchain update.

Right now any change to an compiler option breaks the LTO format
in subtle ways. In fact even the minor changes that are currently
done are not frequent enough to catch all such cases.

So it's unlikely to really work.

-Andi



Re: Does gcc automatically lower optimization level for very large routines?

2019-12-31 Thread Andi Kleen
Qing Zhao  writes:

> (gdb) where 
> #0  0x00ddbcb3 in df_chain_create (src=0x631006480f08, 
> dst=0x63100f306288) at ../../gcc-8.2.1-20180905/gcc/df-problems.c:2267 
> #1  0x001a in df_chain_create_bb_process_use ( 
> local_rd=0x7ffc109bfaf0, use=0x63100f306288, top_flag=0) 
> at ../../gcc-8.2.1-20180905/gcc/df-problems.c:2441 
> #2  0x00dde5a7 in df_chain_create_bb (bb_index=16413) 
> at ../../gcc-8.2.1-20180905/gcc/df-problems.c:2490 
> #3  0x00ddeaa9 in df_chain_finalize (all_blocks=0x63100097ac28) 
> at ../../gcc-8.2.1-20180905/gcc/df-problems.c:2519 
> #4  0x00dbe95e in df_analyze_problem (dflow=0x60600027f740, 
> blocks_to_consider=0x63100097ac28, postorder=0x7f23761f1800, 
> n_blocks=40768) at ../../gcc-8.2.1-20180905/gcc/df-core.c:1179 
> #5  0x00dbedac in df_analyze_1 () 

AFAIK df_analyze is used even with O1 by various passes.  Probably tbe bulk of
the memory consumption is somewhere else, and it just happens to hit
there by chance.

Would be useful to figure out in more details where the memory
consumption goes in your test case.

Unfortunately gcc doesn't have a good general heap profiler,
but I usually do (if you're on Linux). Whoever causes most page
faults likely allocates most memory.

perf record --call-graph dwarf -e page-faults gcc ...
perf report --no-children --percent-limit 5 --stdio > file.txt

and post file.txt into a bug in bugzilla.

-Andi


Re: comparing parallel test runs

2017-05-17 Thread Andi Kleen
Marek Polacek  writes:

> On Wed, May 17, 2017 at 09:13:40AM -0600, Jeff Law wrote:
>> On 05/17/2017 04:23 AM, Aldy Hernandez wrote:
>> > Hi folks.
>> > 
>> > I've been having troubles comparing the results of different test runs
>> > for quite some time, and have finally decided to whine about it. Perhaps
>> > someone can point out to whatever I may be doing wrong.
>> > 
>> > I generally do "make check -k -j60" on two different trees and compare
>> Make sure you've got Andi's patch installed and report back.  It's
>> supposed to help with smaller -j loads, but it's unclear if it's enough
>> to address the problems with higher loads like you're using.
>
> I'm still seeing spurious tree-prof failures there (with -j48).

Do they go away if you run first (as root)

echo 5000 > /proc/sys/kernel/perf_event_mlock_kb

first?

-Andi


Re: Quantitative analysis of -Os vs -O3

2017-08-27 Thread Andi Kleen
Allan Sandfeld Jensen  writes:
>
> Yeah. That is just more problematic in practice. Though I do believe we have 
> support for it. It is good to know it will automatically upgrade 
> optimizations 
> like that. I just wish there was a way to distribute pre-generated arch-
> independent training data.

autofdo supports that in principle (but probably would need some
improvements in the tools  to make it really easily usable, especially
with shared libraries)

-Andi


Re: Byte swapping support

2017-09-13 Thread Andi Kleen
Jürg Billeter  writes:
>
> I don't. The idea is to reverse scalar storage order for the whole
> userspace process and then add byte swapping to the Linux kernel when
> accessing userspace memory. This keeps userspace memory consistent
> with regards to endianness, which should lead to high compatibility
> with big-endian applications. Userspace memory access from the kernel
> always uses a small set of helper functions, which should make it
> easier to insert byte swapping at appropriate places.

I expect you'll find that it isn't that easy. There are a lot of opaque
copies that copy whole structures.

You'll need a whole compat layer, similar to 32<->64bit compat layers,
but actually doing more work because it has to handle all fields larger
than one byte, not just fields which differ in size.

-Andi


Re: Google Summer of Code 2018: Call for mentors and ideas

2018-01-18 Thread Andi Kleen
Martin Jambor  writes:
>
> Therefore I would like to ask all seasoned GCC contributors who would
> like to mentor a GSoC student to send a reply to this thread with their
> idea for a project.  If you have an idea but you do not want to be a
> mentor then I will consider it only if it is really interesting, really
> specific (e.g. improving -O2 -g *somehow* is not specific) and I would
> have to be reasonably confident I'd find a good mentor for it.  So far I
> have the following ideas from the IRC discussion:

Here's an idea:

fuzzers like csmith are fairly good at finding compiler bugs.  But they
only generate standard C, but no extensions. gcc has many extensions,
which are not covered. It would be good to extend a fuzzer like csmith
to fuzz extensions like OpenMP, __attributes__, vector
extensions, etc. Then run the fuzzer and report
compiler bugs.

I'm not a seasoned gcc contributor, but would be willing to mentor
such a project.

-Andi



Re: gcc generated memcpy calls symbol version

2018-01-29 Thread Andi Kleen
Tom Mason  writes:

> Is there any way for me to force the version for these symbols aswell?

It seems pointless because the ABI for these symbols will never change.

-Andi


Re: GSoC

2018-02-26 Thread Andi Kleen
> If the scope of GCC still intimidates you (but we all struggle with it
> sometimes, trust me), consider reaching out to Andi Kleen and discuss
> his fuzzer project idea with him (he may tell you what to check out and
> experiment with to get the feeling about the task at hand).

The fuzzer project is to extend an existing C fuzzer (such as csmith
or yarpgen) with gcc language extensions and use that to test the compiler

Tasks:
- evaluate existing fuzzer sources for suitable base
- study gcc documentation and other references for suitable language
extensions to implement (e.g. openmp or transactions or vector extensions)
- implement a suitable subset of extensions in one of the fuzzers
- validate fuzzer output is valid code (that's the hard part)
- run fuzzer against compiler and report bugs

-Andi


Re: GCC GSOC Participation

2018-03-07 Thread Andi Kleen
On Wed, Mar 07, 2018 at 03:52:15AM +0530, Prathamesh Kulkarni wrote:
> On 3 March 2018 at 16:22, Prateek Kalra  wrote:
> > Hello GCC Community,
> > My name is Prateek Kalra.I am pursuing integrated dual
> > degree(B.tech+M.tech) in Computer Science Software Engineering,from Gautam
> > Buddha University,Greater Noida.I am currently in 8th semester of the
> > programme.
> > I have experience in competitive programming with C++.Here's my linkedin
> > profile:
> > https://www.linkedin.com/in/prateek-kalra-6a40bab3/.
> > I am interested in GSOC project "Implement a fuzzer leveraging GCC
> > extensions".
> > I had opted compiler design as one of the course subjects in the previous
> > semester and was able to secure an 'A' grade at the end of the semester.
> > I have theoretical knowledge of fuzz testing and csmith,that how the random
> > C programs are generated to check the compiler bugs and I am very keen to
> > work under this project.
> > I request you to guide me to progress through the process.I would really
> > appreciate if you could mentor me with the further research of this project
> > idea.

Hi Prateek,

Further research on the project:

- Look at the gcc language extensions in the gcc documentation. Select some
(can be one or a combination of multiple)
of suitable complexity and get familiar with the concepts. Examples are
OpenMP, transactions, vector extensions. For some of them the gcc documentation
is enough, for others (like OpenMP or transactions) you'll need to
download the external specification. Look at the specification
and get familiar with it. Likely you'll also need to do some research
in the underlying concepts (e.g. parallelism for OpenMP or vectorization
for the vector APIs)

- Look at csmith or yarpgen and get familar with the code

Then you can make a choice which fuzzer you want to use as a base
and make a proposal which gcc extensions you would want to target
with the project.

-Andi


Re: GCC GSOC Participation

2018-03-07 Thread Andi Kleen
> I would suggest that you start with reading through Andi's email to
> another student who expressed interest in that project which you can
> find at: https://gcc.gnu.org/ml/gcc/2018-02/msg00216.html
> 
> Andi, do you have any further suggestions what Prateek should check-out,
> perhaps build, examine and experiment with in order to come up with a
> nice proposal? 

See other mail.

> Do you personally prefer starting with any particular
> existing fuzzer, for example?

Yes should start with an existing fuzzer and just add the extensions.
Otherwise too much time will be spent on the base language functionality.
I would suggest either csmith or yarpgen.

-Andi


Re: Fuzzer extension for gcc

2018-06-10 Thread Andi Kleen
On Sun, Jun 10, 2018 at 12:49:44PM +0530, sameeran joshi wrote:
>Hi all,I have been figuring out to work on some project,so while searching
>I found fuzzer implementation project quite interesting,so please can I
>get some information and links about the extension of fuzzer project for
>gcc .
>Can anyone help me please.

Hi,

The deadline for the Google Summer of code project this year has already passed,
so at least for this year it's not possible as a paid project.

However if you're still interested in working on it outside of SoC 
you're welcome of course.

The basic project is to extend an existing C language fuzzer, such
as https://embed.cs.utah.edu/csmith/ or
https://github.com/intel/yarpgen
to cover gcc language extensions,
run it against the compiler and report compiler crashes it finds.

For a description of the gcc language extensions please see 
https://gcc.gnu.org/onlinedocs/gcc-8.1.0/gcc/#toc-Extensions-to-the-C-Language-Family

In addition there are other extensions, such as OpenMP, or the
transactional memory extensions.
https://www.openmp.org/specifications/
https://gcc.gnu.org/wiki/TransactionalMemory

Tasks: 
- Investigate the documentation of some extensions and understand their scope
Pick a reasonable set to implement. For a short term project this
could be one or more simple extensions, or for a longer project this could be
a subset of a complex extension, such as a OpenMP
- Investigate the chosen extensions the code base of one the fuzzers
- Run fuzzing against the compiler
- See if it crashes the compiler or generates invalid output
- Investigate bug reports to see if they are not malformed
- Submit bugs

The main challenge of the project is to understand some extensions well
enough that you can implement a fuzzer and implement them 
in a way that the resulting randomly generated code is not malformed.

-Andi


Re: Bootstrap comparison failure! (gcc 4.6.x with -O3)

2011-03-27 Thread Andi Kleen
Witold Baryluk  writes:
>
> make BOOT_CFLAGS="$CFLAGS -flto" CFLAGS_FOR_BUILD="$CFLAGS"
> CXXFLAGS_FOR_BUILD="$CXXFLAGS" bootstrap

Easier is to configure with --with-build-config=bootstrap-lto
then you don't need all the magic CFLAGS lines.

> And then waited
>
> I actually waited 5 days... (each file compiled about 45minutes on average,
> eating 100% of CPU). Normally whole gcc compiles in 25 minutes on this 
> machine.

It sounds like you don't have enough memory? Did you swap? LTO (or
rather the first phase of it) needs quite a bit more memory than a
normal build. I suspect you don't want to do this with less than 4GB,
better 8G.

If you have /tmp in shmfs it is also much worse because there will
be large temporary files in memory too (workaround is to use 
TMPDIR=/some/dir/on/disk)

>
> After wait I got this:

You have to use -frandom-seed=1 to avoid LTO bootstrap failures.
If you use the build config line above that is done by default.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only


Re: An unusual x86_64 code model

2011-08-11 Thread Andi Kleen
Jed Davis  writes:
>
> But is that the right way to do that, do people think?  Or should I
> look into making this its own -mcmodel option?  (Which would raise the

I would make it a new -mcmodel=... option.

> question of what to call it -- medsmall? smallhigh? altkernel?)  Or is

smallhigh sounds reasonable to me.

-Andi


-- 
a...@linux.intel.com -- Speaking for myself only


Re: Fortran's DO CONCURRENT - make use of it middle-end-wise

2011-09-04 Thread Andi Kleen
Tobias Burnus  writes:
>
> The plan is to translate it as normal loop; however, it would be
> useful if this non-order-dependence could be used by the middle end
> (general optimization or at least for -floop-parallelize-all /
> -ftree-parallelize-loops). Is there a way to tell the middle-end about
> this property?

iirc the cilk plus branch does that for C (also tells the vectorizer,
which is very useful too) But it needed various changes in the middle
end. Perhaps you could reuse some of that code.

-Andi


Re: Questions Regarding DWARF

2011-09-08 Thread Andi Kleen
Kevin Polulak  writes:
>
> I've tried to gain some knowledge by digging through the GCC source
> but haven't come up with much other than the values of the DW_*
> constants which isn't that important. Are there any files in
> particular I should be looking at?

>From the gcc internals manual:

 * Debugging information output

 This is run after final because it must output the stack slot
 offsets for pseudo registers that did not get hard registers.
 Source files are `dbxout.c' for DBX symbol table format,
 `sdbout.c' for SDB symbol table format, `dwarfout.c' for DWARF
 symbol table format, files `dwarf2out.c' and `dwarf2asm.c' for
 DWARF2 symbol table format, and `vmsdbgout.c' for VMS debug symbol
 table format.

-Andi


Re: Merging gdc (GNU D Compiler) into gcc

2011-10-05 Thread Andi Kleen
David Brown  writes:
>
> Some toolchains are configured to have a series of "init" sections at
> startup (technically, that's a matter of the default linker scripts
> and libraries rather than the compiler).  You can get code to run at
> specific times during startup by placing the instructions directly
> within these sections - but it must be "raw" instructions, not
> function definitions (and definitely no "return" statement).

Note that this only works with modern gcc with -fno-reorder-toplevel,
which disables some optimizations, and is currently broken in several
ways with LTO (in progress of being fixed)

Normally you don't get any defined order in these sections, both
unit-at-a-time and LTO do reorder. The linker may also in some
circumstances.

So it's somewhat dangerous to rely on.

-Andi


Re: Vector alignment tracking

2011-10-13 Thread Andi Kleen
Artem Shinkarov  writes:
>
> 1) Currently in C we cannot provide information that an array is
> aligned to a certain number.  The problem is hidden in the fact, that

Have you considered doing it the other way round: when an optimization
needs something to be aligned, make the declaration aligned?

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only


Re: Vector alignment tracking

2011-10-13 Thread Andi Kleen
> Or I am missing someting?

I often see the x86 vectorizer with -mtune=generic generate a lot of
complicated code just to adjust for potential misalignment.

My thought was just if the alias oracle knows what the original
declaration is, and it's available for changes (e.g. LTO), it would be 
likely be better to just add an __attribute__((aligned()))
there.

In the general case it's probably harder, you would need some 
cost model to decide when it's worth it.

Your approach of course would still be needed for cases where this
isn't possible. But it sounded like the infrastructure you're building
could in principle do both.

-Andi


Re: RFC: Add --plugin-gcc option to ar/nm

2011-10-15 Thread Andi Kleen
"H.J. Lu"  writes:

> Hi,
>
> ---plugin option for ar/nm is very long.  I am proposing to add
> a  --plugin-gcc option.  It can be implemented with
>
> 1. Move LTOPLUGINSONAME from gcc to config/plugins.m4.
> 2. Define LTOPLUGINSONAME for ar/nm.
> 3. For --plugin-gcc, ar/nm call popen using environment variable GCC if set,
> or gcc with -print-prog-name=$LTOPLUGINSONAM to get
> plugin name.
>
> Any comments?

I had some old patches for gcc-ar, gcc-nm etc. That's similar how 
other compilers do it.

http://gcc.gnu.org/ml/gcc-patches/2010-10/msg02471.html

With that it doesn't matter how long the option is.

The original feedback was that shell wrapper were not portable enough.
I actually did C based wrappers recently, but haven't submitted them
yet.

I think wrappers are preferable over shorter options because they
are easier to use and easier to fit into existing makefiles.

The best long term direction probably would be to put a reference
to the plugin into the object files and let BFD find it itself. This
would be most compatible with existing Makefiles. But that would need
more work.

-andi

-- 
a...@linux.intel.com -- Speaking for myself only


Re: asm in inline function invalidating function attributes?

2011-10-16 Thread Andi Kleen
Ulrich Drepper  writes:
>
> It's not guaranteed to work in general.  The problem to solve is that
> I know the function which is called is not clobbering any registers.
> If I leave it with the normal function call gcc has to spill
> registers.  If I can hide the function call the generated code can be
> significantly better.

iirc the new shrink wrapping support that just went in was to address
problems like that. Perhaps try a mainline compiler?

But I agree that such an attribute would be useful. I would use it.

At least the Linux kernel has a couple such cases ("nasty inline asm to
hide register clobbering in calls") and it's always ugly and hard to
maintain.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only


Re: asm in inline function invalidating function attributes?

2011-10-17 Thread Andi Kleen
> > At least the Linux kernel has a couple such cases ("nasty inline asm to
> > hide register clobbering in calls") and it's always ugly and hard to
> > maintain.
> 
> It would simply be an alternate ABI that makes all registers callee-saved?

Yes exactly that.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.


Re: Transactional Memory documentation in extend.texi

2011-11-17 Thread Andi Kleen
On Thu, Nov 17, 2011 at 03:14:57PM +0100, Torvald Riegel wrote:
> We are aware that the TM language constructs should be documented in
> extend.texi.  However, the most recent public version of the C++ TM
> specification document is outdated, and a new version is supposed to be
> released in a few weeks.
> 
> Therefore, we'd like to wait until the release of the new specification
> document so that we can just cite this new document instead of having to
> cite the old one and list all the changes.

Why can't you just describe the extensions yourself?

Shipping something undocumented is just pointless. And even if there's
a specification somewhere users typically need a end user oriented
manual too, which a specification is not.

On the other hand documenting the ABI like you currently do 
doesn't seem very interesting to me, I don't know who ever would
need that. If someone wants to implement their own STM they
can as well read the spec themselves or read some header files.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.


Re: LTO multiple definition failures

2012-01-01 Thread Andi Kleen
Sandra Loosemore  writes:
>
> I'm still finding my way around LTO; can anyone who's more familiar
> with this help narrow down where to look for the cause of this?  I
> don't even know if this is a compiler or ld bug at this point.  I'm

I would look into the interaction between the LTO plugin and your ld
(and also try gold if you can)

Generally there are still various issues in these areas which need
workarounds in the LTOed programs, for some things (like ld -r and some
ar) you also need the latest version of HJ Lu's binutils which implement
http://gcc.gnu.org/ml/gcc/2010-12/msg00229.html

A lot of older binutils lds also tended to mishandle mixed LTOed ar
archives.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only


Re: LTO multiple definition failures

2012-01-02 Thread Andi Kleen
> Anyway, the problem here isn't that I particularly care about coming up 
> with some workaround to make LTO work, but rather that tests from the 
> gcc testsuite are failing on this target because of what looks like 
> buggy LTO behavior instead of bugs in the target support, and I wanted 
> to be sure this was being tracked somewhere.  I didn't see a relevant 
> issue in either the gcc or binutils bugzillas, but if it's a known 
> consequence of the ld -r problem, I'll shut up and go away again.  ;-)

AFAIK none of the test suite tests the ld -r problem, at least not on x86-linux.
So it may be something else and still worth tracking down.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.


Re: lto pseudo-object files and fixed registers

2012-02-07 Thread Andi Kleen
Richard Guenther  writes:
>
> You then can do
>
>  gcc $OPTIONS -flto a.c -o a.o
>  gcc $OPTIONS -flto b.c -o b.o
>  gcc $OPTIONS -ffixed-r9 -ffixed-r10 -flto d.c -o d.o
>  gcc $OPTIONS -ffixed-r9 -ffixed-r10 -flto e.c -o e.o
>  gcc $OPTIONS -flto a.o b.o -o non-fixed-reg-part.o -r -nostdlib
>  gcc $OPTIONS -flto -ffixed-r9 -ffixed-r10 d.o e.o -o fixed-reg-part.o
> -r -nostdlib
>  gcc non-fixed-reg-part.o fixed-reg-part.o
>
> thus, optimize both pieces via partial LTO linking (-r, maybe the -nostdlib is
> not needed) and then do the final link separately.

Note that ld -r with LTO has some problems in 4.6, it will work fully
reliable.  This has been fixed in 4.7.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only


Re: Why no strings in error messages?

2009-08-27 Thread Andi Kleen
Bradley Lucier  writes:

> and RBX is used by XLAT, XLATB. 

XLAT* is generally not used anymore, certainly not in gcc generated code.

> Are 12 registers not enough, in principle, to do scheduling before
> register allocation? 

You want to limit gcc to only 12 registers?

> I was getting a 15% speedup on some numerical

That would assume that only using 12 registers doesn't cause
slowdowns elsewhere, which is likely in fact.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.


Re: Prague GCC folks meeting summary report

2009-10-01 Thread Andi Kleen
Richard Guenther  writes:
>
> The wish for more granular and thus smaller debug information (things like
> -gfunction-arguments which would properly show parameter values
> for backtraces) was brought up.  We agree that this should be addressed at a
> tools level, like in strip, not in the compiler.

Is that really the right level? In my experience (very roughly) -g can turn gcc 
from
CPU bound to IO bound (especially considering distributed compiling appraches), 
 
and dropping unnecessary information in external tools would make the IO 
penalty even 
worse.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.


Re: Prague GCC folks meeting summary report

2009-10-02 Thread Andi Kleen
> To make the above work first the external tools have to add the 
> capabilities, just implementing it in GCC doesn't work for us.

Ok, but there's more to building software, than just building rpms.

For example I can definitely see a common use case where a program
is built with line number and unwinding information, but not with types.
Or types are only generated partially using a special "debug file".

And for that it would be best if gcc simply didn't generate unnecessary
information and minimizes IO and disk space requirements.

iirc there was even a patch in this direction from Frank Ch. Eigler

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.


Re: BUG: GCC-4.4.x changes the function frame on some functions

2009-11-19 Thread Andi Kleen
Richard Guenther  writes:
>
> It's likely because you have long long vars on the stack which is
> faster when they are aligned.

It's not faster for 32bit.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.


[PATCH] gcc mcount-nofp was Re: BUG: GCC-4.4.x changes the function frame on some functions

2009-11-20 Thread Andi Kleen
Steven Rostedt  writes:
>
> And frame pointers do add a little overhead as well. Too bad the mcount
> ABI wasn't something like this:
>
>
>   :
>   callmcount
>   [...]
>
> This way, the function address for mcount would have been (%esp) and the
> parent address would be 4(%esp). Mcount would work without frame
> pointers and this whole mess would also become moot.

I did a patch to do this in x86 gcc some time ago. The motivation
was indeed the frame pointer overhead on Atom with tracing.

Unfortunately it also requires glibc changes (I did those too). For
compatibility and catching mistakes the new function was called
__mcount_nofp.

I haven't tried it with current gcc and last time I missed the 
gcc feature merge window with this.

But perhaps you find it useful. Of course it would need more
kernel changes to probe for the new option and handle it.

Here's the old patch. I haven't retested it with a current
gcc version, but I think it still applies at least.

If there's interest I can polish it up and submit formally.

-Andi


Index: gcc/doc/tm.texi
===
--- gcc/doc/tm.texi (revision 149140)
+++ gcc/doc/tm.texi (working copy)
@@ -1884,6 +1884,12 @@
 of words in each data entry.
 @end defmac
 
+...@defmac TARGET_FUNCTION_PROFILE
+Define if the target has a custom function_profiler function.
+The target should not set this macro, it is implicitely set from 
+the PROFILE_BEFORE_PROLOGUE macro.
+...@end defmac
+
 @node Registers
 @section Register Usage
 @cindex register usage
Index: gcc/doc/invoke.texi
===
--- gcc/doc/invoke.texi (revision 149140)
+++ gcc/doc/invoke.texi (working copy)
@@ -593,7 +593,7 @@
 -momit-leaf-frame-pointer  -mno-red-zone -mno-tls-direct-seg-refs @gol
 -mcmod...@var{code-model} -ma...@var{name} @gol
 -m32  -m64 -mlarge-data-thresho...@var{num} @gol
--mfused-madd -mno-fused-madd -msse2avx}
+-mfused-madd -mno-fused-madd -msse2avx -mmcount-nofp}
 
 @emph{IA-64 Options}
 @gccoptlist{-mbig-endian  -mlittle-endian  -mgnu-as  -mgnu-ld  -mno-pic @gol
@@ -11749,6 +11749,11 @@
 @opindex msse2avx
 Specify that the assembler should encode SSE instructions with VEX
 prefix.  The option @option{-mavx} turns this on by default.
+
+...@item -mmcount-nofp
+Don't force the frame counter with @option{-pg} function profiling.
+Instead call a new @code{__mcount_nofp} function before a stack 
+frame is set up.
 @end table
 
 These @samp{-m} switches are supported in addition to the above
Index: gcc/target.h
===
--- gcc/target.h(revision 149140)
+++ gcc/target.h(working copy)
@@ -1132,6 +1132,9 @@
*/
   bool arm_eabi_unwinder;
 
+  /* True when the function profiler code is outputted before the prologue. */
+  bool profile_before_prologue;
+
   /* Leave the boolean fields at the end.  */
 };
 
Index: gcc/final.c
===
--- gcc/final.c (revision 149140)
+++ gcc/final.c (working copy)
@@ -1520,10 +1520,8 @@
 
   /* The Sun386i and perhaps other machines don't work right
  if the profiling code comes after the prologue.  */
-#ifdef PROFILE_BEFORE_PROLOGUE
-  if (crtl->profile)
+  if (targetm.profile_before_prologue && crtl->profile)
 profile_function (file);
-#endif /* PROFILE_BEFORE_PROLOGUE */
 
 #if defined (DWARF2_UNWIND_INFO) && defined (HAVE_prologue)
   if (dwarf2out_do_frame ())
@@ -1565,10 +1563,8 @@
 static void
 profile_after_prologue (FILE *file ATTRIBUTE_UNUSED)
 {
-#ifndef PROFILE_BEFORE_PROLOGUE
-  if (crtl->profile)
+  if (!targetm.profile_before_prologue && crtl->profile)
 profile_function (file);
-#endif /* not PROFILE_BEFORE_PROLOGUE */
 }
 
 static void
Index: gcc/gcc.c
===
--- gcc/gcc.c   (revision 149140)
+++ gcc/gcc.c   (working copy)
@@ -797,6 +797,12 @@
 # define SYSROOT_HEADERS_SUFFIX_SPEC ""
 #endif
 
+/* Target can override this to allow -pg/-fomit-frame-pointer together */
+#ifndef TARGET_PG_OPTION_SPEC
+#define TARGET_PG_OPTION_SPEC \
+"%{pg:%{fomit-frame-pointer:%e-pg and -fomit-frame-pointer are incompatible}}"
+#endif
+
 static const char *asm_debug;
 static const char *cpp_spec = CPP_SPEC;
 static const char *cc1_spec = CC1_SPEC;
@@ -866,8 +872,8 @@
 
 /* NB: This is shared amongst all front-ends, except for Ada.  */
 static const char *cc1_options =
-"%{pg:%{fomit-frame-pointer:%e-pg and -fomit-frame-pointer are incompatible}}\
- %1 %{!Q:-quiet} -dumpbase %B %{d*} %{m*} %{a*}\
+ TARGET_PG_OPTION_SPEC
+" %1 %{!Q:-quiet} -dumpbase %B %{d*} %{m*} %{a*}\
  %{fcompare-debug-second:%:compare-debug-auxbase-opt(%b)} \
  %{!fcompare-debug-second:%{c|S:%{o*:-auxbase-strip %*}%{!o*:-auxbase 
%b}}}%{!c:%{!S:-auxbase %b}} \
  %{g*} %{O*} %{W*&pedantic*} %{w} %{std*&ansi&trigraphs}\
Index: gcc/target-def.h

Re: detailed comparison of generated code size for GCC and other compilers

2009-12-14 Thread Andi Kleen
John Regehr  writes:

> See here:
>
>   http://embed.cs.utah.edu/embarrassing/
>
> There is a lot of data there.  Please excuse bugs and other
> problems. Feedback would be appreciated.
>

I was a bit surprised by the icc results, because traditionally icc doesn't
have a good reputation for good code size (and for my own testing
icc output is usually larger). So I took a look at some of these.

Some of the test cases seem very broken. For example the first 1000%
entry here

http://embed.cs.utah.edu/embarrassing/dec_09/harvest/gcc-head_icc-11.1/

int
fetchBlock (INDATA * in, char *where, int many)
{
  int copy;
  int advance;
  int i;
  int tmp;

  {
advance = 1;
i = 0;
while (i < copy)

"copy" is clearly uninitialized. So how can this function ever have
worked? Depending what's on the stack or in a register it'll corrupt
random memory.

icc wins because it assumes uninitialized variables are 0 and optimizes
away everything.

I wonder if the original program was already broken or was this 
something your conversion introduced?

it might be a good idea to check for unitialized variable warnings
and remove completely broken examples (unfortunately the gcc uninitialized
variables warnings are not 100% accurate)

Looking further down the table a lot of the differences on
empty-after-optimization functions (lots of 5 vs 2 bytes) seem to be
that gcc-head uses frame pointers and the other compiler
doesn't. Clearly for a fair comparison these settings should be the
same.

-Andi


Re: detailed comparison of generated code size for GCC and other compilers

2009-12-14 Thread Andi Kleen
On Mon, Dec 14, 2009 at 09:30:57AM -0700, John Regehr wrote:
> I also noticed these testcases but decided to leave them in for now. 
> Obviously the code is useless, but it can still be interpreted according to 
> the C standard, and code can be generated.  Once you start going down the 
> road of exploiting undefined behavior to create better code -- and gcc 
> already does this pretty aggressively -- why not keep going?

I'm not sure relying on uninitialized variables for optimization is 
a good idea.

> That said, if there's a clear sentiment that this kind of test case is 
> undesirable, I'll make an effort to get rid of these for subsequent runs. 
> The bottom line is that these results are supposed to provide you folks 
> with useful directions for improvement.

I personally feel that test cases that get optimized away are not 
very interesting.

>
>> Looking further down the table a lot of the differences on 
>> empty-after-optimization functions (lots of 5 vs 2 bytes) seem to be that 
>> gcc-head uses frame pointers and the other compiler doesn't. Clearly for a 
>> fair comparison these settings should be the same.
>
> I wanted to avoid playing flag games and go with -Os (or nearest 
> equivalent) for all compilers.  Maybe that isn't right.

At least for small functions this skews the results badly I think.
On larger functions it would be less a problem. 

On my gccs on linux actually no frame pointer is the default, but 
that might depend on the actual target configuration it was built for.
I think that default was changed at some point and it also
depends on the OS.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.


Re: detailed comparison of generated code size for GCC and other compilers

2009-12-15 Thread Andi Kleen
John Regehr  writes:

>> I would only be worried for cases where no warning is issued *and*
>> unitialized accesses are eliminated.
>
> Yeah, it would be excellent if GCC maintained the invariant that for
> all uses of uninitialized storage, either the compiler or else
> valgrind will issue a warning.

My understanding was that valgrind's detection of uninitialized
local variables is not 100% reliable because it cannot track
all updates of the frames (it's difficult to distingush stack
reuse from uninitialized stack)

e.g. 

int f1() { int x; return x; } 
int f2() { int x; return x; } 

int main(void)
{
f1();
f2();
return 0;
}

compiled without optimization so that the variables stay around
still gives no warning in valgrind:

==22573== Memcheck, a memory error detector
==22573== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al.
==22573== Using Valgrind-3.5.0 and LibVEX; rerun with -h for copyright info
==22573== Command: ./a.out
==22573== 
==22573== 
==22573== HEAP SUMMARY:
==22573== in use at exit: 0 bytes in 0 blocks
==22573==   total heap usage: 0 allocs, 0 frees, 0 bytes allocated
==22573== 
==22573== All heap blocks were freed -- no leaks are possible
==22573== 
==22573== For counts of detected and suppressed errors, rerun with: -v
==22573== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 5 from 5)

On the other hand the compiler tends to warn too much for
uninitialized variables, typically because it cannot handle something
like that:

void f(int flag)
{
int local;
if (flag)
... initialize local 
...

if (flag)
... use local 
}

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.


Re: detailed comparison of generated code size for GCC and other compilers

2009-12-15 Thread Andi Kleen
> I am not a valgrind expert so, take the following with a grain of salt
> but I think that the above statement is wrong: valgrind reliably detects
> use of uninitialized variables if you define 'use' as meaning 'affects
> control flow of your program' in valgrind.

It works in some cases for the stack, but not in all. Consider the redzone 
on the x86-64 ABI. How should valgrind distingush an uninitialized redzone
variable from a initialized one if the stack has been used before? I didn't 
even think it worked in all cases for variables in the real frame.

You're right my example was bogus because it didn't test the control flow.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.


Re: GSoC 2010 Project Idea

2010-03-29 Thread Andi Kleen
°àâÕÜ ÈØÝÚÐàÞÒ  writes:

> Hi,
>
> I have a project in mind which I'm going to propose to the GCC in terms of
> Google Summer of Code. My project is not on the list of project ideas
> (http://gcc.gnu.org/wiki/SummerOfCode) that is why it would be very 
> interesting
> for me to hear any opinions and maybe even to find a mentor.

My guess is that the project is a bit too ambitious for a single 
summer. Perhaps try to scale it down to make it more manageable?

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.


Re: dragonegg in FSF gcc?

2010-04-11 Thread Andi Kleen
Robert Dewar  writes:
>
> Sure you can run some benchmarks and look for missed optimization
> opportunities, that's always worth while, for instance people
> regularly compare gcc and icc to look for cases where the gcc
> optimization can be improved

OT, but there's lots of cool data on all of this on 

http://embed.cs.utah.edu/embarrassing/dec_09/

I spent some time some time ago to file a few "missed optimizations"
bugzillas based on examples there and some got addressed. I'm sure
there are lots more nuggets in there (as in easy cases to improve
gcc), but analyzing each example takes time.

The reason each sample takes time to analyze is that it is sometimes
a bug in the other compiler or different target options or so.
Sometimes it's also undefined code.

Yes there are plenty of bugs in there, but I didn't find any for gcc
at least (but only looked at a relatively limited set)

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.


Re: gcc and the kernel

2008-12-30 Thread Andi Kleen
"Brian O'Mahoney"  writes:
>
> Last, a word to the wise, compiler developers, are by nature fairly agressive 
> but, unless you want to work on gcc itself, it is wise to stay a bit behind 
> the bleeding edge, and, unless your systems are excellently backed up,
> ___DONT_BUILD_THE_KERNEL___ with and untested gcc. I mean a kernel you are 
> going to try to run. For an example of why not see the 'ix68 string 
> direction' fiasco noting that an interrupt can happen anywhere and that the 
> CPU x flag is a non-deterministic (to the interrupt code) part of the 
> interrupted context. Oh what fun!

This particular problem has nothing to do with the compiler that built the
kernel.

-Andi
-- 
a...@linux.intel.com


Re: x86-64 and large code model questions/bugs

2009-01-27 Thread Andi Kleen
Steve Ellcey  writes:

> The first one is: would it be reasonable to compile crt files with
> -mcmodel=large by default on the x86-64 linux platform?  Currently the
> crt files are not compiled with this option and that makes it impossible
> to compile a program where the text segment is above 4GB (for example)

Over 2GB.

> because the crt files can't handle the large code model if they aren't
> compiled with this option.

My understanding is that this would likely break old linkers which
didn't do all large model relocations correctly. Right now they
work as long as you don't need the large model.

BTW the standard way to work around large model issues if your
code isn't really that large but you just want to move it is to move 
parts of the programs into a .so and compile -fPIC. large is only
really needed if you really have gigantic programs (but then gcc
tends to be also quite slow on them)

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.


Re: x86-64 and large code model questions/bugs

2009-01-27 Thread Andi Kleen
On Tue, Jan 27, 2009 at 09:33:17AM -0800, Steve Ellcey wrote:
> On Tue, 2009-01-27 at 12:31 +0100, Andi Kleen wrote:
> > Steve Ellcey  writes:
> >
> > > because the crt files can't handle the large code model if they aren't
> > > compiled with this option.
> > 
> > My understanding is that this would likely break old linkers which
> > didn't do all large model relocations correctly. Right now they
> > work as long as you don't need the large model.
> 
> Are these old GNU linkers or old system/non-GNU linkers?  We often

(not that old actually) GNU linkers.

> require a certain minimum version of binutils for GCC to work.

That would be pretty extreme for something as obscure as large model.

> 
> > BTW the standard way to work around large model issues if your
> > code isn't really that large but you just want to move it is to move 
> > parts of the programs into a .so and compile -fPIC. large is only
> > really needed if you really have gigantic programs (but then gcc
> > tends to be also quite slow on them)
> 
> It's not really a gigantic program, it is a program where he wants to
> put all his text in the upper address range (using the --section-start
> option of the GNU linker).

He'll get much better code by putting the program into a -fPIC .so,
loading it from a small stub and then unmap the stub.
large model generates really very bad code because all jumps
will be indirect.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.


Re: x86-64 and large code model questions/bugs

2009-01-28 Thread Andi Kleen
On Wed, Jan 28, 2009 at 09:39:39AM +0100, Paolo Bonzini wrote:
> 
> > He'll get much better code by putting the program into a -fPIC .so,
> > loading it from a small stub and then unmap the stub.
> > large model generates really very bad code because all jumps
> > will be indirect.
> 
> Is it also true with -fpie?

Not sure what you mean?

AFAIK -fpie code is the same quality as -fPIC. Both
are much much better than large model.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.


Re: x86-64 and large code model questions/bugs

2009-01-28 Thread Andi Kleen
On Wed, Jan 28, 2009 at 10:21:16AM +0100, Paolo Bonzini wrote:
> Andi Kleen wrote:
> > On Wed, Jan 28, 2009 at 09:39:39AM +0100, Paolo Bonzini wrote:
> >>> He'll get much better code by putting the program into a -fPIC .so,
> >>> loading it from a small stub and then unmap the stub.
> >>> large model generates really very bad code because all jumps
> >>> will be indirect.
> >> Is it also true with -fpie?
> > 
> > Not sure what you mean?
> 
> Right, sorry.  I meant "can you also use -fpie and use a linker script
> to relocate the text section", if you want to place it high but the code
> is not gigantic?

That would require -fpie crt*

Probably doable, but needs some more effort than just 
using a classical main() stub.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.


Re: GCC & OpenCL ?

2009-02-04 Thread Andi Kleen
"Joseph S. Myers"  writes:
>
> The AMD instruction sets now have public documentation.  Other 
> manufacturers may well follow suit, especially if enough people caring 
> about free software and open documentation choose which GPU to buy on such 
> a basis (which is obviously a personal choice for them).

Intel current generation GPU documentation is available at 
http://intellinuxgraphics.org/documentation.html

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.


Re: Split Stacks proposal

2009-02-26 Thread Andi Kleen
Ian Lance Taylor  writes:

> I've put a project proposal for split stacks on the wiki at
> http://gcc.gnu.org/wiki/SplitStacks .  The idea is to permit the stack
> of a single thread to be split into discontiguous segments, thus
> permitting many more threads to be active at one time without worrying
> about stack overflow or about wasting lots of stack space for inactive
> threads.  The compiler would have to generate code to support detecting
> when new stack space is needed, and to deal with some of the
> consequences of moving to a new stack.

This is mainly for 32bit systems with tight address space right? On a
64bit system you could just do it through the MMU by reserving enough
free VM space for the stack in advance and then handling the page fault.

> I would be interested in hearing comments about this.

In the wiki: Possible strategies 1. 

Should that be "Each function with a _large_ stack frame" ?

Or perhaps for all functions. After all even functions with 
small frame can be nested a lot.

How about alloca() in your scheme?

-Andi



Re: Deprecating Itanium1 for GCC 4.4

2009-03-20 Thread Andi Kleen
Steven Bosscher  writes:

>  case OPT_mfixed_range_:
> @@ -5245,6 +5247,13 @@ ia64_handle_option (size_t code, const char *arg,
> if (!strcmp (arg, processor_alias_table[i].name))
>   {
> ia64_tune = processor_alias_table[i].processor;
> +   if (ia64_tune == PROCESSOR_ITANIUM
> +   && ! warned_merced_deprecated)
> + {
> +   inform ("value %<%s%> for -mtune= switch is deprecated", arg);
> +   inform ("GCC 4.4 is the last release with Itanium1 support");

"... with Itanium1 tuning support"? 

I assume it will be still possible to generate executables that work
there, just not tuned. iirc there are new instructions in I2, but I
assume you don't plan to drop the code to not generate those, just the
scheduling descriptions.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.


Re: -O3 and new optimizations in 4.4.0

2009-04-24 Thread Andi Kleen
Robert Dewar  writes:

> Sebastian Pop wrote:
>> On Fri, Apr 24, 2009 at 08:12, Robert Dewar  wrote:
 What would we have to do to make PPL and CLooG required to build GCC?
>>> Why would that be desirable? Seems to me the current situation is
>>> clearly preferable.
>> To enable loop transforms in -O3.
>
> To me, you would have to show very clearly a significant performance
> gain for typical applications to justify the impact of adding
> PPL and CLooG. I don't see it. If you want these transformations
> you can get them, why go to all this disruptive effort for the
> default optimization case?

I think his point was that they would be only widely used if they
were part of -O3 because likely most users are not willing to 
set individual -f optimization flags.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.


Re: Making CFLAGS=-g1 bigger but more useful

2009-04-25 Thread Andi Kleen
On Fri, Apr 24, 2009 at 08:24:56PM -0400, Frank Ch. Eigler wrote:
> 
> I'm working on a little patch that extends the data produced for the
> little-used (?)  -g1 mode.  Normally, this produces very little DWARF
> data (basically just function declaration locus, PC range, and basic
> backtrace-enabling data).  Compared to normal -g (== -g2) mode, this
> is very small.

I think what I would like to have is a modus to generate line numbers
(so that objdump -S works nicely) but nothing else. That would be useful
for crash dump analysis etc.

-Andi


Re: Making CFLAGS=-g1 bigger but more useful

2009-04-25 Thread Andi Kleen
> Another possibility, though a much bigger amount of work, would be to
> introduce -g options like -f.  The presence of such an option would
> imply -g1 or higher, and then you could add -gparameters,
> -gline-numbers, -gvar-tracking, -gmacros, etc.

I would like to have that.
-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.


Re: Making CFLAGS=-g1 bigger but more useful

2009-04-25 Thread Andi Kleen
On Sat, Apr 25, 2009 at 10:30:51AM -0700, Ian Lance Taylor wrote:
> Andi Kleen  writes:
> 
> > On Fri, Apr 24, 2009 at 08:24:56PM -0400, Frank Ch. Eigler wrote:
> >> 
> >> I'm working on a little patch that extends the data produced for the
> >> little-used (?)  -g1 mode.  Normally, this produces very little DWARF
> >> data (basically just function declaration locus, PC range, and basic
> >> backtrace-enabling data).  Compared to normal -g (== -g2) mode, this
> >> is very small.
> >
> > I think what I would like to have is a modus to generate line numbers
> > (so that objdump -S works nicely) but nothing else. That would be useful
> > for crash dump analysis etc.
> 
> It's not quite that, but the gold linker has a --strip-debug-non-line
> option which discards all the debugging information except what is
> needed to map addresses to lines.

The reason I would like to have it is that generating so much data
slows down gcc compilation a lot and there are cases where I don't
need the full data. Striping it in the linker is a start, but
doesn't address the size of the object files. So doing it in the compiler
would be better.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.


Re: [Announcement] Creating lightweight IPO branch

2009-05-05 Thread Andi Kleen
Xinliang David Li  writes:
>
> If the idea is generally accepted, I will prepare a series of patches
> and submit them to gcc trunk.

I was reading your wiki page. Interesting idea. 

One aspect that wasn't clear to me on reading it was how different
compiler arguments for different files are handled. How would the
compiler compiling another source file know what special arguments
(like -I -D or special -f options) it needs? Or in what directory it
was compiled in for -I paths? Or do you assume that is always all the
same?

The other thing that wasn't clear to me was the worst case
behaviour. If there's a single large file x.c that has a function that
is called from 100 other small files with hot call sites then x.c
would be compiled 100 times, right? Is there something to mitigate
such worst case behaviour?

Thanks,
-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.


Re: [Announcement] Creating lightweight IPO branch

2009-05-05 Thread Andi Kleen
On Tue, May 05, 2009 at 10:25:13AM -0700, Xinliang David Li wrote:
> Andi,
> 
> On Tue, May 5, 2009 at 1:49 AM, Andi Kleen  wrote:
> > Xinliang David Li  writes:
> >>
> >> If the idea is generally accepted, I will prepare a series of patches
> >> and submit them to gcc trunk.
> >
> > I was reading your wiki page. Interesting idea.
> >
> > One aspect that wasn't clear to me on reading it was how different
> > compiler arguments for different files are handled. How would the
> > compiler compiling another source file know what special arguments
> > (like -I -D or special -f options) it needs? Or in what directory it
> > was compiled in for -I paths? Or do you assume that is always all the
> > same?
> 
> This is actually mentioned in the document. Please see the section
> about the module info record format. In each module info, the
> information about -I paths, -D/-Us are recorded. Those will be passed
> to the preprocessor when each module is processed.

I found it after the post :/. But it seems to be only a small subset of
options.  How about -m or -f flags?

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.


Re: New GCC releases comparison and comparison of GCC4.4 and LLVM2.5 on SPEC2000

2009-05-13 Thread Andi Kleen
"Joseph S. Myers"  writes:

> On Tue, 12 May 2009, Chris Lattner wrote:
>
>> 1. I have a hard time understanding the code size numbers.  Does 10% mean 
>> that
>> GCC is generating 10% bigger or 10% smaller code than llvm?
>
> I have a different comment on the code size numbers: could we have 
> comparisons of code size for -Os rather than (or in addition to) -O2 and 
> -O3?  If someone is particularly concerned with code size, -Os is what 
> they are expected to use.

It's a slippery slope that -O2 is getting so bad regarding
code size.  What should people do who need performance, but cannot
completely disregard code size (and can't use profile feedback for
some reason).

Also with limited caches code size for large programs typically tends
to affect performance too.

In my experience -Os has some significant drawbacks because it
prioritizes everything code size over everything else. For example one
-Os problem case we ran into was on x86 it always uses a fully generic
hardware division when dividing by constant even for cases where you
can generate a much faster sequence by spending a few more bytes.
Also it disables a lot of other useful optimizations.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.


Re: New GCC releases comparison and comparison of GCC4.4 and LLVM2.5 on SPEC2000

2009-05-13 Thread Andi Kleen
> From looking http://vmakarov.fedorapeople.org/spec/I2Size32.png it does
> not look that bad at all.  For SpecFP it is different, but code size is

The code size seems to be much worse than LLVM at least, unless
I misread the graphs. 

Also my comment was in regard of the suggestion to try -Os -- -Os 
is not the answer for -O2 code size regressions.

> Rather, we should seriously understand what caused the compilation time
> jump in 4.2, and whether those are still a problem.  We made a good job
> in 4.0 and 4.3 offsetting the slowdowns from infrastructure changes with
> speedups from other changes; and 4.4 while slower than 4.3 at least
> stays below 4.2.  But, 4.2 was a disaster for compilation time.

Yes that would be useful, although I admit for me personally
make -j and icecream do a pretty good job at hiding that pain.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.


Re: Using C++ in GCC is OK

2010-05-31 Thread Andi Kleen
Mark Mitchell  writes:
>
> I think virtual functions are on the edge; quite useful, but do result
> in the compiler adding a pointer to data objects and in uninlinable
> indirect calls at run-time.  Therefore, I would avoid them in the

Is that still true given profile feedback and the recent
devirtualization work? I would assume the common case
to get inlined then anyways.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.


hot/cold pointer annotation

2010-06-10 Thread Andi Kleen
Hi Honza,

Here's an idea to make it easier to manually annotate
large C code bases for hot/cold functions where
it's too difficult to use profile feedback.

It's fairly common here to call function through
function pointers in manual method tables.

A lot of code is targetted by a few function pointers
(think like backends or drivers) 

Some of these function pointers always point to cold
code (e.g. init/exit code) while others are usually
hot.

Now as an alternative to manually annotate the hot/cold 
functions it would be much simpler to annotate the function
pointers and let the functions that get assigned to 
inherit that.

So for example 

struct ops {
void (*init)() __attribute__((cold));
void (*exit)() __attribute__((cold));
void (*hot_op)() __attribute__((hot));
};

void init_a(void) {} 
void exit_a(void) {}
void hot_op(void) {} 

const struct ops objecta = {
.init = init_a,
.exit = exit_a,
.hot_op = hot_op_a
};

/* lots of similar objects with struct ops method tables */

init_a, exit_a and their callees (if they are not
called by anything else) would automatically become all cold,
and hot_op_a (and unique callees) hot, because they
are assigned to a cold or hot function pointer.

Basically the hot/coldness would be inheritted from
a function pointer assignment too.

Do you think a scheme like this would be possible to implement?

Thanks,
-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.


Re: hot/cold pointer annotation

2010-06-11 Thread Andi Kleen
Richard Guenther  writes:
>
> Iff the inheritence is restricted to initializers like in the above example
> it should be straight-forward to implement in the frontends.

Great.

For best results would need some inheritance to callees too, if the
callees are not called by other non annotated code (that would probably only
work in whole-program mode, but that's fine)

> If it should handle general assignments in code-fragments it would
> be much harder (and a lot more fragile).  So - would restricting
> this to initializers be good enough?

Yes I think only handling it in initializers would be enough.

There are some exceptions where function pointers
are assigned directly, but they are relatively rare.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.


  1   2   3   >