Re: Comparison of GCC-4.9 and LLVM-3.4 performance on SPECInt2000 for x86-64 and ARM
Re: Comparison of GCC-4.9 and LLVM-3.4 performance on SPECInt2000 for x86-64 and ARM
Dear all, Do you have any results of GCC and LLVM performance comparisons of different versions (for *ARM* architecture)? It's not obvious question to find such comparisons in Web, since Phoronix usually publishes comparisons for x86 and x86_64, and last comparison for ARM was performed in 2012: LLVM/Clang vs. GCC On The ARM Cortex-A15 Preview (1 December 2012): http://www.phoronix.com/scan.php?page=article&item=llvm_gcc_a15&num=1 GCC vs. LLVM/Clang Compilers On ARMv7 Linux (09 May 2012 ): http://www.phoronix.com/scan.php?page=news_item&px=MTA5OTM Did anybody ever try to measure the dynamics of performance changes of GCC and LLVM (i. e. 2 comparative graphs from version to version) - for arm architecture? Best regards, Ilya Palachev
Re: What are open tasks about GIMPLE loop optimizations?
Dear Evgeniya, Maybe missed optimizations in vectorizer will be interesting to you https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 It has a lot of open tasks that can highly influence the performance, but many of them have not been solved for long years. For now gcc vectorizer works in some number of patterns, but there are a lot of ones that are implemented in icc or llvm and not implemented in gcc. Best regards. Ilya *From:* Evgeniya Maenkova *Sent:* Friday, August 15, 2014 4:45PM *To:* gcc@gcc.gnu.org *Subject:* What are open tasks about GIMPLE loop optimizations? Dear GCC Developers, Nobody answers my question below, so perhaps something wrong with my email :) So let me clarify in more details what I’m asking about. I’ve made some very very very basic evaluation of GCC code ([1]) and started to think about concrete task to contribute to GCC (language and machine optimization would be interesting to me, in particular, loop optimization). I cannot invent this task myself because my knowledge of GCC and compilers in general is not enough for this. And even if I could think out something perhaps GCC developers have their own understanding of the world. Then I have looked at GCC site to answer my question. What I could find about loop optimizations is information from GNU Tools Cauldron 2012, “Status of High level Loop Optimizations”. So perhaps this is out-of-date in 2014. Unfortunately, I have not enough time, so I would not commit to manage a task which is on the critical task. (Are you interested only in full time developers?) So it would be great if you could advise some tasks, which could be useful to gcc in some future, however nobody will miss if I cannot do it (as you had not time/people for these tasks anyway :) ). What do you think? Thanks, Evgeniya [1] Used GDB to look inside GCC. Wrote some notes in my blog which could be useful to other newbies (http://perfstories.wordpress.com/2013/11/17/compiler-internals-introduction-to-a-new-post-series/). -- Forwarded message -- From: Evgeniya Maenkova Date: Fri, Aug 8, 2014 at 6:50 PM Subject: GIMPLE optimization passes: any Road Map? To: gcc@gcc.gnu.org Dear GCC Developers! Could you please clarify about GIMPLE loop passes? Where could I find the latest changes in these passes? Is it trunk or some of the branches? May I look at some RoadMap on GIMPLE loop optimizations? Actually, I ask these questions because I would like to contribute to GCC. GIMPLE optimizations would be interesting to me (in particular, loop optimizations). However, I’m newbie at GCC and have not enough time, so would not commit to manage a task which is on the critical path. So it would be great if you could advise some tasks, which could be useful to gcc in some future, however nobody will miss if I can’t do it (as you had not time/people for these tasks anyway :) ). Thank you! Evgeniya
Performance for AArch64 in ILP32 ABI
Hi, According to this mail thread https://gcc.gnu.org/ml/gcc-patches/2013-12/msg00282.html GCC has ILP32 GNU/Linux support. 1. The question is: how reasonable it can be to use ILP32 mode for building of the *whole* Linux distribution from the side of view of performance? IIRC gcc built for i686 can work faster than gcc built for x86_64 architecture on the same hardware, because there are a lot of data structures with fields of pointer type, and if 32 pointers are used, less memory is allocated for these structures. As a result, smaller structures are loaded from memory faster and less cache misses happen. Is this the same case for AArch64 ILP32 ABI? 2nd idea is that if integers are of 32 bit size, than 2 times more integers can be saved in CPU registers than if they were of 64 bit size, and thus less loads/stores to the memory are needed. 2. What's the current status of ILP32 support implementation in GCC? 3. Did anybody try to benchmark AArch64 binaries ILP32 vs. LP64 builds? Is it possible to compare the performance of these modes? Best regards, Ilya Palachev
[RFC] cortex-a{53,57}-simd.md missing?
Hi all. This is a question related with current development of Aarch64 backend. In latest trunk revision of GCC 5.0, in directory gcc/config/arm there are following files: cortex-a{8,9,15,17}.md cortex-a{8,9,15,17}-neon.md These files contain constructions like (define_insn_reservation insn-name default_latency condition regexp) for both scalar and vector (NEON) instructions. But for AdvSIMD aarch64 instructions in cortex-a53.md only the following lines can be found: ;; Crude Advanced SIMD approximation. (define_insn_reservation "cortex_53_advsimd" 4 (and (eq_attr "tune" "cortexa53") (eq_attr:q "is_neon_type" "yes")) "cortex_a53_simd0") Does it mean that all AdvSIMD instructions for cortex-a53 are supposed to be of latency = 4? In cortex-a57.md the description for "neon" instructions is more full, it contains a lot of statements for different SIMD instructions. It appeared in trunk just a month ago. Are there any plans to release detailed pipeline descriptions for SIMD instructions for cortex-a53? How can it influence the performance of the generated code? -- Best regards, Ilya Palachev
Re: AutoFDO profile toolchain is open-sourced
Hi, Here are some questions about AutoFDO. On 08.05.2014 02:55, Dehao Chen wrote: We have open-sourced AutoFDO profile toolchain in: https://github.com/google/autofdo For GCC developers, the most important tool is create_gcov, which converts sampling based profile to GCC-readable profile. Please refer to the readme file (https://raw.githubusercontent.com/google/autofdo/master/README) for more details. In the mentioned README file it is said that " In order to collect this profile, you will need to have an Intel CPU that have last branch record (LBR) support." Is this information obsolete? Chrome Canary builds use AutoFDO for ARMv7l (https://code.google.com/p/chromium/issues/detail?id=434587) What about Aarch64 support? Is it supported? To use the profile, one need to checkout https://gcc.gnu.org/svn/gcc/branches/google/gcc-4_8. We are working on porting AutoFDO to trunk (http://gcc.gnu.org/ml/gcc-patches/2014-05/msg00438.html). For now AutoFDO was merged into gcc-5.0 (trunk) branch. Is it possible to backport it to 4.9 branch? Can you estimate required efforts for that? We have limited doc inside the open-sourced package, and we are planning to add more content to the wiki page (https://github.com/google/autofdo/wiki). Feel free to send me emails or discuss on github if you have any questions. Cheers, Dehao -- Best regards, Ilya
Re: AutoFDO profile toolchain is open-sourced
Hi, One more question. On 10.04.2015 23:39, Jan Hubicka wrote: I must say I did not even try running AutoFDO myself (so I am happy to hear it works). I tried to use executable create_gcov built from AutoFDO repository at github. The problem is that the data generated by this program has size 1600 bytes not depending on the profile data given to it. Steps to reproduce the issue: 1. Build AutoFDO under x86_64 2. Build, for example, the benchmark ytest.c (see attachment): g++ -O2 -o ytest ytest.c -g2 (I used g++ that was built just now from gcc-5-branch branch from git://gcc.gnu.org/git/gcc.git) 3. Run it under perf to collect the profile data: sudo perf record ./ytest The perf reports no error and says that [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.125 MB perf.data (~5442 samples) ] Perf generates perf.data. 4. Run create_gcov on the obtained data: create_gcov --binary ytest --profile perf.data --gcov ytest.gcov --debug_dump It creates 2 files: * ytest.gcov which is 1600 bytes of size * ytest.gcov.imports which is empty Also there is no debug output from the program. If I run create_llvm_prof on the data create_llvm_prof --binary ytest --profile perf.data --out ytest.out --debug_dump It reports the following log: Length of symbol map: 1 Number of functions: 0 and creates an empty file ytest.out. Which is not true: all functions in the benchmark are marked with __attribute__((noinline)) and readelf says that they stay in the binary: readelf -s ytest | grep px_cycle 56: 00400640 111 FUNCGLOBAL DEFAULT 12 _Z8px_cyclei readelf -s ytest | grep py_cycle 60: 004006b036 FUNCGLOBAL DEFAULT 12 _Z8py_cyclev The size of resulting gcov data is the same (1600 bytes) for different levels of debug information (-g0, -g1, -g2) and for different input sources files. What am I doing wrong? -- Best regards, Ilya Palachev #define DX (480*4) #define DY (640*4) int* src = new int[DX*DY]; int* dst = new int[DX*DY]; int pxm = DX; int pym = DY; void px_cycle(int py) __attribute__((noinline)); void px_cycle(int py) { int *p1 = dst + (py*pxm); int *p2 = src + (pym - py - 1); for (int px = 0; px < pxm; px++) { if (px < pym && py < pxm) { *p1 = *p2; } p1++; p2 += pym; } } void py_cycle() __attribute__((noinline)); void py_cycle() { for (int py = 0; py < pym; py++) { px_cycle(py); } } int main() { int i; for (i = 0; i < 100; i++) { py_cycle(); } return 0; }
Re: AutoFDO profile toolchain is open-sourced
ping? On 15.04.2015 10:41, Ilya Palachev wrote: Hi, One more question. Does anybody know with which options should the perf be executed so that to collect appropriate data for the autofdo converter? I obtain the same data for different programs, and it seems to be empty (1600 Bytes). They have the same md5sum for different programs: # Data for simple program with 30 lines of code: $ md5sum ytest.gcov d85481c9154aa606ce4893b64fe109e7 ytest.gcov # Data for program of 3D Delaunay triangulation construction of 100 points. $ md5sum experimentCGAL_convexHullDynamic.gcov d85481c9154aa606ce4893b64fe109e7 experimentCGAL_convexHullDynamic.gcov We tried to collect perf data using option --call-graph fp but it does not help: the output gcov data is still the same. Sometimes create_gcov reports the following error: E0421 13:10:37.125629 8732 perf_parser.cc:209] Mapped 50% of samples, expected at least 95% But it does not mean that there are not enough samples collected in the profile, because 99% of samples are mapped in the case of very simple program (with 1 function). I try to find working case for more than a week but did not suceeded. Can anybody show me that create_gcov works at least for one case? -- Best regards, Ilya Palachev
Re: AutoFDO profile toolchain is open-sourced
On 21.04.2015 14:57, Diego Novillo wrote: >From the autofdo page: https://github.com/google/autofdo [ ... ] Inputs: --profile: PERF_PROFILE collected using linux perf (with last branch record). In order to collect this profile, you will need to have an Intel CPU that have last branch record (LBR) support. You also need to have your linux kernel configured with LBR support. To profile: # perf record -c PERIOD -e EVENT -b -o perf.data -- ./command EVENT is refering to BR_INST_RETIRED:TAKEN if available. For some architectures, BR_INST_EXEC:TAKEN also works. [ ... ] The important one for autofdo is -b. It asks perf to use LBR registers for branch tracking (assuming your architecture supports it). Thanks! It worked. Now big programs produce big gcov files. Sorry for this confusing message. But why create_gcov does not inform about that (no branch events were found)? It creates empty gcov file and says nothing :( Moreover, in the mentioned README it is said that perf should also be executed with option -e BR_INST_RETIRED:TAKEN. I tried to add it but perf said that invalid or unsupported event: 'BR_INST_RETIRED:TAKEN' Run 'perf list' for a list of valid events For my architecture x86_64 the perf list contains $ sudo perf list | grep -i br branch-instructions OR branches[Hardware event] branch-misses [Hardware event] branch-loads [Hardware cache event] branch-load-misses [Hardware cache event] branch-instructions OR cpu/branch-instructions/[Kernel PMU event] branch-misses OR cpu/branch-misses/[Kernel PMU event] mem:[:access] [Hardware breakpoint] syscalls:sys_enter_brk [Tracepoint event] syscalls:sys_exit_brk [Tracepoint event] There is no BR_INST_RETIRED:TAKEN there. Do you use some specific configuration of perf for that? However, I tried to use option "-e branch-instructions". Before that the following error was obtained: E0421 15:57:39.308374 11551 perf_parser.cc:210] Mapped 50% of samples, expected at least 95% and now it disappeared (because of option "-e branch-instructions"). Though, the performance decreases after adding option "-fauto-profile=file.gcov" or "-fprofile-use=file.gcov" to the list of compiler options. The program becomes 10% slower than before. Can you explain that? Maybe I should configure perf so that it will be able to collect events BR_INST_RETIRED:TAKEN ? How can it be done? -- Best regards, Ilya Palachev
Re: AutoFDO profile toolchain is open-sourced
Hi, On 21.04.2015 20:25, Dehao Chen wrote: OTOH, the most important patch (insn-level discriminator support) is not in yet. Cary has just retired. Do you know if anyone would be interested in porting insn-level discriminator support to trunk? Do you mean r210338, r210397, r210523, r214745 ? Can you explain why these patches are important for autofdo? What work should be done to port them to current 5 branch? Do you expect them to be applied to 6 branch? -- Ilya
Re: AutoFDO profile toolchain is open-sourced
On 11.04.2015 01:49, Xinliang David Li wrote: On Fri, Apr 10, 2015 at 3:43 PM, Jan Hubicka wrote: LBR is used for both cfg edge profiling and indirect call Target value profiling. I see, that makes sense ;) I guess if we want to support profile collection on targets w/o this feature we could still use one of the algorithms that try to guess edge profile from BB profile. Our experience with sampling cycles or retired instructions to guess BB profile has not been great -- the profile quality is significantly worse than LBR (which can almost match instrumentation based profile). Suppose that I have no opportunity to collect profile on x86 architecture with LBR support and the only available architecture is arm/aarch64 (since the application code is significantly different when compiled for different architectures because of manual optimizations and different function names and structure). Honza has mentioned that it's possible to guess edge profile from BB profile. How do you think, can this help in the above described situation? Yes, this will be much worse than LBR, but can it give any performance benefit compared with no edge profile at all? -- Ilya