gcc parallel make check
I've noticed that make -j -k check-fortran results in a serialized checking, while make -j32 -k check-fortran goes parallel. Somehow the explicit 'N' in -jN seems to be needed for the check target, while the other targets seem to do just fine. Is that a feature, or should I file a PR for that... ? Somewhat related is there a rule of thumb on how is the granularity of parallel check decided ? E.g. check-fortran seems to be limited to about ~5 parallel targets, which is few for a typical server (but of course a welcome speedup already). Thanks, Joost
RE: gcc parallel make check
> It is intentional. With -j it is essentially a fork bomb, just don't use it. well, silently ignoring it for just this target did cost me a lot of time, while an eventual fork bomb would have been dealt with much more quickly. >> Somewhat related is there a rule of thumb on how is the granularity of >> parallel check decided ? E.g. check-fortran seems to be limited to about >> ~5 parallel targets, which is few for a typical server (but of course a >> welcome speedup already). > >The splitting has some cost (e.g. lots of various checks are cached, with >split jobs they need to be done in each separate goal), and the goal of the >split is toplevel make check parallelization, not individual directory or >language testing. For the latter perhaps more fine grained split could be >useful, but how would one find out if it is a toplevel make check, or say >make -C gcc check where you test many languages, or check-gfortran? the cost must be small compared to the possible gain... on a 32 core server, testing of fortran FE changes would be 4x larger. I notice that even on a full check, the Fortran tests are still running when the number of processes is already way below 32. However, the longest running (by a few minutes) are those: expect -- /usr/share/dejagnu/runtest.exp --tool gcc lto.exp weak.exp tls.exp ipa.exp tree-ssa.exp debug.exp dwarf2.exp fixed-point.exp vxworks.exp cilk-plus.exp vmx.exp pch.exp simulate-thread.exp x86_64-costmodel-vect.exp i386-costmodel-vect.exp spu-costmodel-vect.exp ppc-costmodel-vect.exp charset.exp noncompile.exp tsan.exp graphite.exp compat.exp expect -- /usr/share/dejagnu/runtest.exp --tool g++ lto.exp tls.exp gcov.exp debug.exp dwarf2.exp cilk-plus.exp pch.exp bprob.exp simulate-thread.exp vect.exp charset.exp tsan.exp graphite.exp compat.exp struct-layout-1.exp ubsan.exp tm.exp gomp.exp dfp.exp tree-prof.exp stackalign.exp plugin.exp guality.exp asan.exp ecos.exp so can those be run more independently ?
RE: gcc parallel make check
> What did you expect for -j alone? an error? No, as is standard in gnu make, a new process for any target that can be processed (i.e. unlimited). >> ... check-fortran seems to be limited to about ~5 parallel targets ... > >Running the make with -j8 gives 7 directories gfortran[1-6]? in gcc/testsuite/. >Note that the load balancing could be improved: few minutes with a single >thread >over ~20 minutes. I'd like to have roughly 32 directories (or as many of the -jN allows for).
RE: gcc parallel make check
> I have to admit that I don't know why that's the case. Actually Marc answered that one (I had the wrong mail address for gcc@ so repeat here): https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53155 > See: gcc/fortran/Make-lang.in, which has: I'll have a look and do some testing what the gains/costs of a further split are. Joost
RE: gcc parallel make check
>> expect -- /usr/share/dejagnu/runtest.exp --tool gcc lto.exp weak.exp tls.exp >> ipa.exp tree-ssa.exp debug.exp >dwarf2.exp fixed-point.exp vxworks.exp >> cilk-plus.exp vmx.exp pch.exp simulate-thread.exp x86_64-costmodel-vect.exp >> i386-costmodel-vect.exp spu-costmodel-vect.exp ppc-costmodel-vect.exp >> charset.exp noncompile.exp tsan.exp graphite.exp compat.exp >> expect -- /usr/share/dejagnu/runtest.exp --tool g++ lto.exp tls.exp gcov.exp >> debug.exp dwarf2.exp cilk-plus.exp pch.exp bprob.exp simulate-thread.exp >> vect.exp charset.exp tsan.exp graphite.exp compat.exp struct-layout-1.exp >> ubsan.exp tm.exp gomp.exp dfp.exp tree-prof.exp stackalign.exp plugin.exp >> guality.exp asan.exp ecos.exp >> >> so can those be run more independently ? >It is a moving target, new tests are added every day. I'm trying to adjust >it during stage3/stage4 occassionally, but it also very much depends on >which target it is (e.g. i?86/x86_64 has many more tests in i386.exp then >other targets in their gcc.target), how fast the compiler is on the target >(e.g. on some targets -g is much slower than on others, etc.). could you point me to the right file (or example commit) for trying to adjust this ? I can try to do some testing and come back with some numbers.
[PATCH] RE: gcc parallel make check
> The splits are in the Makefiles, see check_gcc_parallelize attached is a patch to improve the parallel performance of 'make -jXX -k check-fortran'. For XX=16, this yields ~50% speedup, and even with XX=4 we still have 15%, the measured slowdown at XX=1 (<2%) is in the noise of testing. The patch is a simple update of the 'check_gfortran_parallelize' variable, updating it from its 2008 values to a set that I found +- optimal based on several tests. Detailed timings are : # timings/trunk-check-fortran #cores averagestd. dev. #tests 1 2955.3275.06 3 2 1735.30 122.26 3 4 929.5154.19 3 8 470.29 7.85 3 16 468.09 4.29 3 32 466.06 1.24 3 # timings/patched-check-fortran #cores averagestd. dev. #tests 1 3008.8916.38 3 2 1534.17 118.33 3 4 800.1831.71 3 8 418.71 0.20 2 16 298.29 5.86 3 32 299.84 1.34 3 There is no effect on a full 'make -j32 -k check' as other goals run for much longer (to be looked at in a followup). A second part of the patch is a new file 'contrib/generate_tcl_patterns.sh' which generates the needed regexp to do the split based on an input of the files in the target directory. It basically groups the initial characters such that each regexp tries not to exceed a maximum number of files. So, the number of files is used as a proxy for the runtime. While I don't feel to strong about adding this (shell/gawk) script, it certainly is convenient, and makes sure that no characters are missing from the regexp. The maximum number of files per regexp is an input, testing (-j16) with 200, 300, 400 I found that 300 was optimal for testsuite/gfortran.dg, but this will depend on many things. A sample run would look like gcc/gcc/testsuite/gfortran.dg> ls -1 | ../../../contrib/generate_tcl_patterns.sh 300 "dg.exp=gfortran.dg/" Adding label: p matching files:499 Adding label: c matching files:497 Adding label: a matching files:448 Adding label: i matching files:350 Adding label: d matching files:245 Adding label: s matching files:211 Adding label: b matching files:206 Adding label: t matching files:180 Adding label: f matching files:173 Adding label: e matching files:166 Adding label: r matching files:165 Adding label: n matching files:162 Adding label: mu matching files:278 Adding label: wlgo matching files:284 Adding label: vhzPkqWx_-9876543210ZYXVUTSRQONMLKJIHGFEDCBAyj matching files:94 patterns: dg.exp=gfortran.dg/p* \ dg.exp=gfortran.dg/c* \ dg.exp=gfortran.dg/a* \ dg.exp=gfortran.dg/i* \ dg.exp=gfortran.dg/\[wlgo\]* \ dg.exp=gfortran.dg/\[mu\]* \ dg.exp=gfortran.dg/d* \ dg.exp=gfortran.dg/s* \ dg.exp=gfortran.dg/b* \ dg.exp=gfortran.dg/t* \ dg.exp=gfortran.dg/f* \ dg.exp=gfortran.dg/e* \ dg.exp=gfortran.dg/r* \ dg.exp=gfortran.dg/n* \ dg.exp=gfortran.dg/\[vhzPkqWx_-9876543210ZYXVUTSRQONMLKJIHGFEDCBAyj\]* \ Is the current attached patch OK for trunk ? contrib/ChangeLog 2014-09-05 Joost VandeVondele * generate_tcl_patterns.sh: New file. gcc/fortran/ChangeLog 2014-09-05 Joost VandeVondele * Make-lang.in (check_gfortran_parallelize): improved parallelism. Index: contrib/generate_tcl_patterns.sh === --- contrib/generate_tcl_patterns.sh (revision 0) +++ contrib/generate_tcl_patterns.sh (revision 0) @@ -0,0 +1,86 @@ +#! /bin/sh + +# +# based on a list of filenames as input, +# generate regexps that match subsets trying to not exceed a +# 'maxcount' parameter. Most useful to generate the +# check_LANG_parallelize assignments needed to split +# testsuite directories, defining prefix appropriately. +# +# Example usage: +# cd gcc/gcc/testsuite/gfortran.dg +# ls -1 | ../../../contrib/generate_tcl_patterns.sh 300 "dg.exp=gfortran.dg/" +# +# the first parameter is the maximum number of files. +# the second parameter the prefix used for printing. +# + +# Copyright (C) 2014 Free Software Foundation +# Contributed by Joost VandeVondele +# +# This file is part of GCC. +# +# GCC is free software; you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 3, or (at your option) +# any later version. +# +# GCC is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with GCC; see the file COPYING. If not, write to +# the Free Software Foundation, 51 Franklin Street, Fifth Floor, +# Boston, MA 02110-1301, USA. + +gawk -v maxcount=$1 -v prefix=$2 ' +BEGIN{ +
RE: [PATCH] RE: gcc parallel make check
>> > Please sort the letters (LC_ALL=C sort) and where consecutive, use ranges. >> > Thus \[0-9A-Zhjqvx-z\]* OK, works fine with the attached patch, and looks cleaner in Make-lang.in. Now, with the proper email address for gcc-patches... I wonder how many time I'll be punished for typos. unmodified CL. Joost Index: contrib/generate_tcl_patterns.sh === --- contrib/generate_tcl_patterns.sh (revision 0) +++ contrib/generate_tcl_patterns.sh (revision 0) @@ -0,0 +1,108 @@ +#! /bin/sh + +# +# based on a list of filenames as input, +# generate regexps that match subsets trying to not exceed a +# 'maxcount' parameter. Most useful to generate the +# check_LANG_parallelize assignments needed to split +# testsuite directories, defining prefix appropriately. +# +# Example usage: +# cd gcc/gcc/testsuite/gfortran.dg +# ls -1 | ../../../contrib/generate_tcl_patterns.sh 300 "dg.exp=gfortran.dg/" +# +# the first parameter is the maximum number of files. +# the second parameter the prefix used for printing. +# + +# Copyright (C) 2014 Free Software Foundation +# Contributed by Joost VandeVondele +# +# This file is part of GCC. +# +# GCC is free software; you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 3, or (at your option) +# any later version. +# +# GCC is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with GCC; see the file COPYING. If not, write to +# the Free Software Foundation, 51 Franklin Street, Fifth Floor, +# Boston, MA 02110-1301, USA. + +gawk -v maxcount=$1 -v prefix=$2 ' +BEGIN{ + # list of allowed starting chars for a file name in a dir to split + achars="0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" + ranget="112233" +} +{ + nfiles++ ; files[nfiles]=$1 +} +END{ + for(i=1; i<=length(achars); i++) count[substr(achars,i,1)]=0 + for(i=1; i<=nfiles; i++) { + if (length(files[i]>0)) { count[substr(files[i],1,1)]++ } + }; + asort(count,ordered) + countsingle=0 + groups=0 + label="" + for(i=length(achars);i>=1;i--) { +countsingle=countsingle+ordered[i] +for(j=1;j<=length(achars);j++) { + if(count[substr(achars,j,1)]==ordered[i]) found=substr(achars,j,1) +} +count[found]=-1 +label=label found +if(i==1) { val=maxcount+1 } else { val=ordered[i-1] } +if(countsingle+val>maxcount) { + subset[label]=countsingle + print "Adding label: ", label, "matching files:" countsingle + groups++ + countsingle=0 + label="" +} + } + print "patterns:" + asort(subset,ordered) + for(i=groups;i>=1;i--) { +for(j in subset){ + if(subset[j]==ordered[i]) found=j +} +subset[found]=-1 +if (length(found)==1) { + printf("%s%s* \\\n",prefix,found) +} else { + sortandcompress() + printf("%s\\[%s\\]* \\\n",prefix,found) +} + } +} +function sortandcompress(i,n,tmp,bestj) +{ + n=length(found) + for(i=1; i<=n; i++) tmp[i]=substr(found,i,1) + asort(tmp) + for(i=1;i<=n;i++){ +ipos=index(achars,tmp[i]) +for(j=i;j<=n;j++){ + jpos=index(achars,tmp[j]) + if (jpos-ipos==j-i && substr(ranget,ipos,1)==substr(ranget,jpos,1)) bestj=j +} +if (bestj-i>3) { + tmp[i+1]="-" + for(j=i+2;j
RE: [PATCH] RE: gcc parallel make check
Attached is an extended version of the patch, it brings a 100% improvement in make -j32 -k check-gcc (down from 20min to <10min) by modification of check_gcc_parallelize. It includes one non-trivial part, namely a split of the target exps. They are now all split using a common choice (based on i386), which I believe is reasonable as it is the target with most tests, and the patterns will be somewhat similar for other targets (e.g. split of p(rxxx)). The implementation of this in the makefile uses an odd looking technique to substitute spaces with commas in a variable, if this can be done more elegantly, I'm happy to make the change. Bootstrap and testing revealed one issue, i386.exp hard-codes a loop for the testcase 'vect-args.c' in order to test 10 different combinations of options. With the current split (i.e. target x4) this test will thus be executed 4 times. There are two easy options 1) keep the current setup, overhead is small 2) keep the .exp file simple and just replicate this test 10x I've selected 1), but I can update a patch with 2). Ideally dg-options in the testcase file itself could be repeated, but I haven't found an example of this. The script now includes sorting and compression of the ranges, and an additional sanity check on the input, i.e. that file names start with [0-9A-Za-z]. Some (few) files seem to start with _ or # (in ./gcc.dg/cpp/). I'll follow up with a separate patch to improve check_g++_parallelize. Full 'make -j k32 check' is now dominated by libstdc++ testing, which contains single goals that run ~1100s (e.g. regex related tests). These uses a slightly different syntax (see gcc/libstdc++-v3/testsuite/Makefile.am) and I'm not yet sure how to deal with the .am files. current patch OK for trunk ? Joost patch-speedup-checkfortran-v05.CL Description: patch-speedup-checkfortran-v05.CL Index: contrib/generate_tcl_patterns.sh === --- contrib/generate_tcl_patterns.sh (revision 0) +++ contrib/generate_tcl_patterns.sh (revision 0) @@ -0,0 +1,114 @@ +#! /bin/sh + +# +# based on a list of filenames as input, starting with [0-9A-Za-z], +# generate regexps that match subsets trying to not exceed a +# 'maxcount' parameter. Most useful to generate the +# check_LANG_parallelize assignments needed to split +# testsuite directories, defining prefix appropriately. +# +# Example usage: +# cd gcc/gcc/testsuite/gfortran.dg +# ls -1 | ../../../contrib/generate_tcl_patterns.sh 300 "dg.exp=gfortran.dg/" +# +# the first parameter is the maximum number of files. +# the second parameter the prefix used for printing. +# + +# Copyright (C) 2014 Free Software Foundation +# Contributed by Joost VandeVondele +# +# This file is part of GCC. +# +# GCC is free software; you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 3, or (at your option) +# any later version. +# +# GCC is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with GCC; see the file COPYING. If not, write to +# the Free Software Foundation, 51 Franklin Street, Fifth Floor, +# Boston, MA 02110-1301, USA. + +gawk -v maxcount=$1 -v prefix=$2 ' +BEGIN{ + # list of allowed starting chars for a file name in a dir to split + achars="0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" + ranget="112233" +} +{ + if (index(achars,substr($1,1,1))==0){ + print "file : " $1 " does not start with an allowed character." + _assert_exit = 1 + exit 1 + } + nfiles++ ; files[nfiles]=$1 +} +END{ + if (_assert_exit) exit 1 + for(i=1; i<=length(achars); i++) count[substr(achars,i,1)]=0 + for(i=1; i<=nfiles; i++) { + if (length(files[i]>0)) { count[substr(files[i],1,1)]++ } + }; + asort(count,ordered) + countsingle=0 + groups=0 + label="" + for(i=length(achars);i>=1;i--) { +countsingle=countsingle+ordered[i] +for(j=1;j<=length(achars);j++) { + if(count[substr(achars,j,1)]==ordered[i]) found=substr(achars,j,1) +} +count[found]=-1 +label=label found +if(i==1) { val=maxcount+1 } else { val=ordered[i-1] } +if(countsingle+val>maxcount) { + subset[label]=countsingle + print "Adding label: ", label, "matching files:" countsingle + groups++ + countsingle=0 + label="" +} + } + print "patterns:" + asort(subset,ordered) + for(i=groups;i>=1;i--) { +for(j in subset){ + if(subset[j]==ordered[i]) found=j +} +subset[found]=-1 +if (length(found)==1) { + printf("%s%s* \\\n",prefix,found) +} else { + sortandcompress() + pri
RE: [PATCH] RE: gcc parallel make check
> +# ls -1 | ../../../contrib/generate_tcl_patterns.sh 300 > "dg.exp=gfortran.dg/" > > How does this work with subdirectories? Can we replace ls with find? The input to the script is general, you can use this to your advantage. For example, I've been using: ls -1 g++.*/* | cut -c5- | ../../../contrib/generate_tcl_patterns.sh 700 old-deja.exp=g++.old-deja/g++. to split at a deeper level or find . -name "[0-9A-Za-z]*" -type f -printf "%f\n" | ../../../../contrib/generate_tcl_patterns.sh 300 dg-torture.exp=torture/ to collect statistics also from subdirs. > + if (_assert_exit) exit 1 > > Haven't you already exited above? yes, but the END{} block in awk is nevertheless executed, unless protected as above.
RE: [PATCH] RE: gcc parallel make check
> No. As I wrote earlier, splitting on filenames and test counts only is only > very rough split, all the splits really need to be backed out by real timing > data from popular targets. I'm actually doing quite some testing trying to get a reasonable balance, checking 'completed in' in all *.log.sep files. However, it is important that the procedure is semi-automatic, otherwise few people will be interested in doing so. Furthermore, for parallel performance, it is not so important that times are distributed evenly (it is anyway unlikely the number of goals is exactly divided by N of -jN), but rather that the goals are ordered (executed) from slow to fast (similar to omp schedule guided). Most of the real bottlenecks are single letter patterns (e.g. p* since pr is such a common filename), and this is ultimately limiting. In the project (CP2K) I'm working on, we also parallelize testing over directories, but we keep a list of approximate runtimes per directory, and keep that (global) list sorted. Testing follows that list. As a result, we have near perfect parallel speedup, despite (or because) timings per directory ranging from a few 100s to 1s. > Also, I'm afraid of some tests being left out > unintentionally (e.g. the wildcards created at some point, then a new test > is added with a weird starting character that hasn't been used before and > suddenly it will not be tested with make -j?). I agree this is an issue, partially addressed by not having to write patterns by hand anymore (i.e. a script does this), and by having the script check its input. There are something like 10 testnames that do not fall in [0-9A-Za-z], as mentioned in a previous email.
RE: [PATCH] RE: gcc parallel make check
> If you get whitespace right, one can provide multiple different wildcards to > a single *.exp file, e.g. > make check-gcc RUNTESTFLAGS="dg.exp='p[0-9A-Za-qs-z]* pr[9A-Za-z]*'" should > cover all tests starting with p other than pr[0-8]*.c (where you could split > say pr[0-2]* into another job, pr[3-5]* into another and pr[6-8]* into > another. I think this confirms that it becomes very delicate to try and write these more complex patterns. The above would miss p_test.c, p-1.c, etc ? For other classes of files the difference is even further down the filename (e.g. using dates as in 20020508-3.c going from 2000 to 2014, or avx*), making the automatic generation of the patterns more complicated. I certainly don't want to claim that the patch I have now is perfect, it is rather an incremental improvement on the current setup.
RE: [PATCH] RE: gcc parallel make check
Now with gzipped figure.. why do these bounce ? > But if there are jobs that just take 1s to complete, then clearly it doesn't > make sense to split them off as separate job. I think we don't need 100% > even split, but at least roughly is highly desirable. Let me add some data, attached is a graph (logscale y) showing the runtime of tests before and after my changes (including a new patch for c++). There is virtually no change for tests running shorter than 50s, only slowly running tests have been split. Now, there are only very few slow tests remaining: gcc_trunk/obj.new> find . -name "*.log" | xargs grep " completed in " | sort -n -k 5 | tail -n 10 ./gcc/testsuite/gcc/gcc.log:testcase /data/vjoost/gnu/gcc_trunk/gcc/gcc/testsuite/gcc.dg/torture/dg-torture.exp completed in 521 seconds ./x86_64-unknown-linux-gnu/libstdc++-v3/testsuite/libstdc++.log:testcase /data/vjoost/gnu/gcc_trunk/gcc/libstdc++-v3/testsuite/libstdc++-dg/conformance.exp completed in 530 seconds ./x86_64-unknown-linux-gnu/libstdc++-v3/testsuite/libstdc++.log:testcase /data/vjoost/gnu/gcc_trunk/gcc/libstdc++-v3/testsuite/libstdc++-dg/conformance.exp completed in 553 seconds ./x86_64-unknown-linux-gnu/libgomp/testsuite/libgomp.log:testcase /data/vjoost/gnu/gcc_trunk/gcc/libgomp/testsuite/libgomp.fortran/fortran.exp completed in 561 seconds ./gcc/testsuite/gcc/gcc.log:testcase /data/vjoost/gnu/gcc_trunk/gcc/gcc/testsuite/gcc.c-torture/compile/compile.exp completed in 625 seconds ./x86_64-unknown-linux-gnu/libstdc++-v3/testsuite/libstdc++.log:testcase /data/vjoost/gnu/gcc_trunk/gcc/libstdc++-v3/testsuite/libstdc++-dg/conformance.exp completed in 683 seconds ./gcc/testsuite/g++/g++.log:testcase /data/vjoost/gnu/gcc_trunk/gcc/gcc/testsuite/g++.dg/dg.exp completed in 702 seconds ./x86_64-unknown-linux-gnu/libstdc++-v3/testsuite/libstdc++.log:testcase /data/vjoost/gnu/gcc_trunk/gcc/libstdc++-v3/testsuite/libstdc++-dg/conformance.exp completed in 726 seconds ./gcc/testsuite/gcc/gcc.log:testcase /data/vjoost/gnu/gcc_trunk/gcc/gcc/testsuite/gcc.c-torture/execute/execute.exp completed in 752 seconds ./x86_64-unknown-linux-gnu/libstdc++-v3/testsuite/libstdc++.log:testcase /data/vjoost/gnu/gcc_trunk/gcc/libstdc++-v3/testsuite/libstdc++-dg/conformance.exp completed in 904 seconds They, of course, limit the ultimate speedup. timings.png.gz Description: timings.png.gz
RE: [PATCH] RE: gcc parallel make check
Attached is a further revision of the patch, now dealing with check-c++. Roughly 50% speedup here at '-j32' (18m vs 12m). For my setup (--enable-languages=c,c++,fortran) I have now improved all targets called in 'make -j32 -k check'. The latter is now 30% faster (15m vs 20m). Note that there are +- 1m fluctuations in these numbers, easily. I currently have no plans to work on other check targets before this patch is committed. OK for trunk ? Joost contrib/ChangeLog 2014-09-09 Joost VandeVondele * generate_tcl_patterns.sh: New file. gcc/fortran/ChangeLog 2014-09-09 Joost VandeVondele * Make-lang.in (check_gfortran_parallelize): Improved parallelism. gcc/Changelog 2014-09-09 Joost VandeVondele * Makefile.in (check_gcc_parallelize): Improved parallelism. (check_p_numbers): Increase maximum value. (dg_target_exps): Mention targets as separate words only. (null,space,comma,dg_target_exps_p1,dg_target_exps_p2, dg_target_exps_p3,dg_target_exps_p4): New variables. gcc/cp/ChangeLog 2014-09-09 Joost VandeVondele * Make-lang.in (check_g++_parallelize): Improved parallelism. libstdc++-v3/ChangeLog 2014-09-09 Joost VandeVondele * testsuite/Makefile.am (check_DEJAGNU_normal_targets): Add check-DEJAGNUnormal[11-15]. (check-DEJAGNU): Split into 15 jobs for parallel testing. * testsuite/Makefile.in: Regenerated. Index: libstdc++-v3/testsuite/Makefile.am === --- libstdc++-v3/testsuite/Makefile.am (revision 215017) +++ libstdc++-v3/testsuite/Makefile.am (working copy) @@ -101,7 +101,7 @@ new-abi-baseline: @test ! -f $*/site.exp || mv $*/site.exp $*/site.bak @mv $*/site.exp.tmp $*/site.exp -check_DEJAGNU_normal_targets = $(patsubst %,check-DEJAGNUnormal%,0 1 2 3 4 5 6 7 8 9 10) +check_DEJAGNU_normal_targets = $(patsubst %,check-DEJAGNUnormal%,0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15) $(check_DEJAGNU_normal_targets): check-DEJAGNUnormal%: normal%/site.exp # Run the testsuite in normal mode. @@ -111,7 +111,7 @@ check-DEJAGNU $(check_DEJAGNU_normal_tar if [ -z "$*$(filter-out --target_board=%, $(RUNTESTFLAGS))" ] \ && [ "$(filter -j, $(MFLAGS))" = "-j" ]; then \ $(MAKE) $(AM_MAKEFLAGS) $(check_DEJAGNU_normal_targets); \ - for idx in 0 1 2 3 4 5 6 7 8 9 10; do \ + for idx in 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15; do \ mv -f normal$$idx/libstdc++.sum normal$$idx/libstdc++.sum.sep; \ mv -f normal$$idx/libstdc++.log normal$$idx/libstdc++.log.sep; \ done; \ @@ -138,25 +138,35 @@ check-DEJAGNU $(check_DEJAGNU_normal_tar fi; \ dirs="`cd $$srcdir; echo [013-9][0-9]_*/*`";; \ normal1) \ - dirs="`cd $$srcdir; echo [ab]* de* [ep]*/*`";; \ + dirs="`cd $$srcdir; echo e*/*`";; \ normal2) \ - dirs="`cd $$srcdir; echo 2[01]_*/*`";; \ + dirs="`cd $$srcdir; echo 28_*/a*`";; \ normal3) \ - dirs="`cd $$srcdir; echo 22_*/*`";; \ + dirs="`cd $$srcdir; echo 23_*/[lu]*`";; \ normal4) \ - dirs="`cd $$srcdir; echo 23_*/[a-km-tw-z]*`";; \ + dirs="`cd $$srcdir; echo 2[459]_*/*`";; \ normal5) \ - dirs="`cd $$srcdir; echo 23_*/[luv]*`";; \ + dirs="`cd $$srcdir; echo 2[01]_*/*`";; \ normal6) \ - dirs="`cd $$srcdir; echo 2[459]_*/*`";; \ + dirs="`cd $$srcdir; echo 23_*/[m-tw-z]*`";; \ normal7) \ - dirs="`cd $$srcdir; echo 26_*/* 28_*/[c-z]*`";; \ + dirs="`cd $$srcdir; echo 26_*/*`";; \ normal8) \ dirs="`cd $$srcdir; echo 27_*/*`";; \ normal9) \ - dirs="`cd $$srcdir; echo 28_*/[ab]*`";; \ + dirs="`cd $$srcdir; echo 22_*/*`";; \ normal10) \ dirs="`cd $$srcdir; echo t*/*`";; \ + normal11) \ + dirs="`cd $$srcdir; echo 28_*/b*`";; \ + normal12) \ + dirs="`cd $$srcdir; echo 28_*/[c-z]*`";; \ + normal13) \ + dirs="`cd $$srcdir; echo de* p*/*`";; \ + normal14) \ + dirs="`cd $$srcdir; echo [ab]* 23_*/v*`";; \ + normal15) \ + dirs="`cd $$srcdir; echo 23_*/[a-k]*`";; \ esac; \ if [ -n "$*" ]; then cd "$*"; fi; \ if $(SHELL) -c "$$runtest --version" > /dev/null 2>&1; then \ Index: libstdc++-v3/testsuite/Makefile.in === --- libstdc++-v3/testsuite/Makefile.in (revision 215017) +++ libstdc++-v3/testsuite/Makefile.in (working copy) @@ -301,7 +301,7 @@ lists_of_files = \ extract_symvers = $(glibcxx_builddir)/scripts/extract_symvers baseline_subdir := $(shell $(CXX) $(baseline_subdir_switch)) -check_DEJAGNU_normal_targets = $(patsubst %,check-DEJAGNUnormal%,0 1 2 3 4 5 6 7 8 9 10) +check_DEJAGNU_normal_targets = $(patsubst %,check-DEJAGNUnormal%,0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15) # Runs the testsuite, but in compile only mode. # Can be used to test sources with non-GNU FE's at various warning @@ -562,7 +562,7 @@ check-DEJAGNU $(check_DEJAGNU_normal_tar if [ -z "$*$(filter-out --target_board=%, $(RUNTES
RE: [PATCH] RE: gcc parallel make check
Thanks for testing. The vect-args.c I explained earlier, and is indeed due to i386.exp hardcoding those. The libstdc++ double counts didn't appear in my testing, but I'll have a look. Note that these patterns are handwritten, so error prone. The long tests in libstdc++ come from (in timing order, from my machine): normal1) \ dirs="`cd $$srcdir; echo e*/*`";; \ normal2) \ dirs="`cd $$srcdir; echo 28_*/a*`";; \ normal3) \ dirs="`cd $$srcdir; echo 23_*/[lu]*`";; \ normal4) \ dirs="`cd $$srcdir; echo 2[459]_*/*`";; \
RE: [PATCH] RE: gcc parallel make check
> You mean enhancing the script to split across arbitrarily long prefixes? > That would be great. I've now a script that does something like that: ~/test$ find /data/vjoost/gnu/gcc_trunk/gcc/gcc/testsuite/gfortran.dg/ -maxdepth 1 -type f -printf "%f\n" | ./generate_patterns.py 500 foo All 3947 files matched the pattern ^[0-9A-Za-z_#+-]*([.][0-9A-Za-z_#+-]+)+ without exception Final 12 patterns and match count: (^[j-z_#+-][p-z_#+-][0-9A-Za-i][0-9A-Za-z_#+-]*([.][0-9A-Za-z_#+-]+)+|^[j-z_#+-][0-9A-Za-o][0-9A-Za-m]([.][0-9A-Za-z_#+-]+)+) matching 469 files (^[0-9A-Za-i][0-9A-Za-n][0-9A-Za-n][0-9A-Za-o][0-9A-Za-z_#+-]*([.][0-9A-Za-z_#+-]+)+|^([.][0-9A-Za-z_#+-]+)+) matching 433 files (^[j-z_#+-][0-9A-Za-o][n-z_#+-][0-9A-Za-z_#+-]*([.][0-9A-Za-z_#+-]+)+|^[0-9A-Za-i][0-9A-Za-n][o-z_#+-]([.][0-9A-Za-z_#+-]+)+) matching 400 files (^[j-z_#+-][p-z_#+-][j-z_#+-][0-9A-Za-z_#+-]*([.][0-9A-Za-z_#+-]+)+|^[0-9A-Za-i]([.][0-9A-Za-z_#+-]+)+) matching 371 files (^[0-9A-Za-i][o-z_#+-][s-z_#+-][0-9A-Za-z_#+-]*([.][0-9A-Za-z_#+-]+)+|^[0-9A-Za-i][0-9A-Za-n][0-9A-Za-n]([.][0-9A-Za-z_#+-]+)+) matching 323 files (^[0-9A-Za-i][o-z_#+-][0-9A-Za-r][o-z_#+-][0-9A-Za-z_#+-]*([.][0-9A-Za-z_#+-]+)+|^[j-z_#+-][p-z_#+-]([.][0-9A-Za-z_#+-]+)+) matching 314 files (^[0-9A-Za-i][o-z_#+-][0-9A-Za-r][0-9A-Za-n][0-9A-Za-z_#+-]*([.][0-9A-Za-z_#+-]+)+|^[j-z_#+-][0-9A-Za-o]([.][0-9A-Za-z_#+-]+)+) matching 314 files (^[j-z_#+-][0-9A-Za-o][0-9A-Za-m][0-9A-Za-i][0-9A-Za-z_#+-]*([.][0-9A-Za-z_#+-]+)+|^[j-z_#+-]([.][0-9A-Za-z_#+-]+)+) matching 272 files (^[0-9A-Za-i][0-9A-Za-n][0-9A-Za-n][p-z_#+-][0-9A-Za-z_#+-]*([.][0-9A-Za-z_#+-]+)+|^[0-9A-Za-i][o-z_#+-]([.][0-9A-Za-z_#+-]+)+) matching 270 files (^[0-9A-Za-i][0-9A-Za-n][o-z_#+-][0-9A-Za-l][0-9A-Za-z_#+-]*([.][0-9A-Za-z_#+-]+)+|^[0-9A-Za-i][0-9A-Za-n]([.][0-9A-Za-z_#+-]+)+) matching 265 files (^[0-9A-Za-i][0-9A-Za-n][o-z_#+-][m-z_#+-][0-9A-Za-z_#+-]*([.][0-9A-Za-z_#+-]+)+|^[0-9A-Za-i][o-z_#+-][0-9A-Za-r]([.][0-9A-Za-z_#+-]+)+) matching 260 files ^[j-z_#+-][0-9A-Za-o][0-9A-Za-m][j-z_#+-][0-9A-Za-z_#+-]*([.][0-9A-Za-z_#+-]+)+ matching 256 files It is a set of patterns that will match any file of the form '^[0-9A-Za-z_#+-]*([.][0-9A-Za-z_#+-]+)+', but such that it splits a list of input files roughly in equal chunks (e.g. between 500 and 500/2 in this example), even if files have long overlapping prefixes. However, I'm unsure if/how this can be integrated, i.e. what precisely is allowed for testsuite filenames, and if this regexp format can be employed in gcc makefiles / tcl / expect harness, suggestions/help appreciated.
RE: [PATCH] RE: gcc parallel make check
Jakub, > First of all, the -j2 testing shows more tests tested in gcc and libstdc++: > >-# of expected passes 10133 >+# of expected passes 10152 > >+PASS: 23_containers/set/modifiers/erase/abi_tag.cc (test for excess errors) >[...] > >Not sure where the bug is, could be e.g. in i386.exp for gcc, but for >libstdc++ less likely to be there rather than in the split. I looked into this, and believe this problem is already in current trunk, and not due to my patch. I.e. unmodified trunk also has these tests executed several times: libstdc++-v3/testsuite/normal4/libstdc++.log.sep:PASS: 23_containers/map/modifiers/erase/abi_tag.cc libstdc++-v3/testsuite/normal1/libstdc++.log.sep:PASS: 23_containers/map/modifiers/erase/abi_tag.cc I believe the current trunk pattern could indeed match those twice (Makefile.in in trunk): normal1) \ dirs="`cd $$srcdir; echo [ab]* de* [ep]*/*`";; \ normal4) \ dirs="`cd $$srcdir; echo 23_*/[a-km-tw-z]*`";; \ could it be that the pattern in normal1 should have been '[ab]*/ de*/ [ep]*/*' ? Joost
RE: [PATCH] RE: gcc parallel make check
> could it be that the pattern in normal1 should have been '[ab]*/ de*/ > [ep]*/*' ? I've checked that this fixes the bug in the current trunk split. I.e. files are stil tested, but now only once. Consider this change added to the previously submitted patch.
RE: [PATCH] RE: gcc parallel make check
>> could it be that the pattern in normal1 should have been '[ab]*/ de*/ >> [ep]*/*' ? > >Yes, we are running these tests multiple times: > >PASS: 23_containers/map/modifiers/erase/abi_tag.cc (test for excess errors) >PASS: 23_containers/multimap/modifiers/erase/abi_tag.cc (test for excess >errors) >PASS: 23_containers/multiset/modifiers/erase/abi_tag.cc (test for excess >errors) >PASS: 23_containers/set/modifiers/erase/abi_tag.cc (test for excess errors) >PASS: 26_numerics/complex/abi_tag.cc (test for excess errors) > >I'll fix that. Actually, the proper pattern should presumably be '[ab]*/* de*/* [ep]*/*' even though it seems to make no difference in testing. I'll have this included in yet another version of the parallel make check patch (plus some further reschuffling as requested by Jakub), so I think there is no need for you to fix this now.
RE: [PATCH] gcc parallel make check
> Here is a patch I'm testing now: Hi Jakub, I also tested your patch to compare timings vs a newer patch (v8) I'll send soon == patch v8 == make -j32 -k == check-fortran 4m58.178s check-c++ ~10m check-c ~10m check 15m29.873s == patch Jakub check-c++ ~20m check-fortran 3m31.237s check-c 8m8 on the positive side, your patch provides a further speedup e.g. fortran and c testing (where it splits things nicely). The libstdc++ bottleneck is not solved, but I guess that is expected. As you have presumably found as well, your patch introduces a number failures, because some tests seem to have additional dependencies, either explicit or implicit: e.g. in gfortran.dg/binding_label_tests_10_main.f03 ! { dg-do compile } ! This file must be compiled AFTER binding_label_tests_10.f03, which it ! should be because dejagnu will sort the files. module binding_label_tests_10_main in gfortran.dg/class_45b.f03 ! { dg-do link } ! { dg-additional-sources class_45a.f03 } This could clearly trigger as well in the current scheme of splitting, only we have been lucky that dependencies seem to be 'well behaved' in having the same initial letter in the filename. Joost
RE: [PATCH] gcc parallel make check
> And these Fortran inter-test dependencies, which Tobias told me is > PR56408. > For PR56408 we need some fix. BTW, is there anything special about Fortran ? There are at least 180 test files that contain 'dg-additional-sources' some in a very non-local way: ./objc.dg/foreach-2.m: /* { dg-additional-sources "../objc-obj-c++-shared/nsconstantstring-class-impl.m" } */ Joost
RE: [PATCH] gcc parallel make check
>>> >For PR56408 we need some fix. >> BTW, is there anything special about Fortran ? There are at least 180 test >> files that contain 'dg-additional-sources' >some in a very non-local way: >The current scheme comes at its limits in that case. . See the files listed in >the PR for issues. So, what about a pragmatic solution, and move the tests that rely on being serialized to a subdirectory serialized/ where, like now, we rely on the implicit ordering we have now ? At least it makes this assumption somewhat explicit. Joost
RE: [PATCH] gcc parallel make check
> a newer patch (v8) I'll send soon attached with updated changelog. Compared to the previously posted v6, only the libstdc++-v3/testsuite/Makefile.am has been refined to split a little more the e*/* pattern, and two quickly running goal have been merged, in addition to fixing the pre-exisiting error in some of the patterns in that file. Checked comparing testsuite results before after. Obviously, if Jakub's patch can be made to work around the testsuite special cases, I believe it should be superior. If not, the attached patch is working as far as I can tell, and provides a significant improvement over current trunk. Joostcontrib/ChangeLog 2014-09-12 Joost VandeVondele * generate_tcl_patterns.sh: New file. gcc/fortran/ChangeLog 2014-09-12 Joost VandeVondele * Make-lang.in (check_gfortran_parallelize): Improved parallelism. gcc/Changelog 2014-09-12 Joost VandeVondele * Makefile.in (check_gcc_parallelize): Improved parallelism. (check_p_numbers): Increase maximum value. (dg_target_exps): Mention targets as separate words only. (null,space,comma,dg_target_exps_p1,dg_target_exps_p2, dg_target_exps_p3,dg_target_exps_p4): New variables. gcc/cp/ChangeLog 2014-09-12 Joost VandeVondele * Make-lang.in (check_g++_parallelize): Improved parallelism. libstdc++-v3/ChangeLog 2014-09-12 Joost VandeVondele * testsuite/Makefile.am (check_DEJAGNU_normal_targets): Add check-DEJAGNUnormal[11-15]. (check-DEJAGNU): Split into 15 jobs for parallel testing, correct pattern. * testsuite/Makefile.in: Regenerated. Index: libstdc++-v3/testsuite/Makefile.in === --- libstdc++-v3/testsuite/Makefile.in (revision 215147) +++ libstdc++-v3/testsuite/Makefile.in (working copy) @@ -301,7 +301,7 @@ lists_of_files = \ extract_symvers = $(glibcxx_builddir)/scripts/extract_symvers baseline_subdir := $(shell $(CXX) $(baseline_subdir_switch)) -check_DEJAGNU_normal_targets = $(patsubst %,check-DEJAGNUnormal%,0 1 2 3 4 5 6 7 8 9 10) +check_DEJAGNU_normal_targets = $(patsubst %,check-DEJAGNUnormal%,0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15) # Runs the testsuite, but in compile only mode. # Can be used to test sources with non-GNU FE's at various warning @@ -562,7 +562,7 @@ check-DEJAGNU $(check_DEJAGNU_normal_tar if [ -z "$*$(filter-out --target_board=%, $(RUNTESTFLAGS))" ] \ && [ "$(filter -j, $(MFLAGS))" = "-j" ]; then \ $(MAKE) $(AM_MAKEFLAGS) $(check_DEJAGNU_normal_targets); \ - for idx in 0 1 2 3 4 5 6 7 8 9 10; do \ + for idx in 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15; do \ mv -f normal$$idx/libstdc++.sum normal$$idx/libstdc++.sum.sep; \ mv -f normal$$idx/libstdc++.log normal$$idx/libstdc++.log.sep; \ done; \ @@ -589,25 +589,35 @@ check-DEJAGNU $(check_DEJAGNU_normal_tar fi; \ dirs="`cd $$srcdir; echo [013-9][0-9]_*/*`";; \ normal1) \ - dirs="`cd $$srcdir; echo [ab]* de* [ep]*/*`";; \ + dirs="`cd $$srcdir; echo experimental/* ext/[a-m]*`";; \ normal2) \ - dirs="`cd $$srcdir; echo 2[01]_*/*`";; \ + dirs="`cd $$srcdir; echo 28_*/a*`";; \ normal3) \ - dirs="`cd $$srcdir; echo 22_*/*`";; \ + dirs="`cd $$srcdir; echo 23_*/[lu]*`";; \ normal4) \ - dirs="`cd $$srcdir; echo 23_*/[a-km-tw-z]*`";; \ + dirs="`cd $$srcdir; echo 2[459]_*/*`";; \ normal5) \ - dirs="`cd $$srcdir; echo 23_*/[luv]*`";; \ + dirs="`cd $$srcdir; echo 2[01]_*/*`";; \ normal6) \ - dirs="`cd $$srcdir; echo 2[459]_*/*`";; \ + dirs="`cd $$srcdir; echo 23_*/[m-tw-z]*`";; \ normal7) \ - dirs="`cd $$srcdir; echo 26_*/* 28_*/[c-z]*`";; \ + dirs="`cd $$srcdir; echo 26_*/*`";; \ normal8) \ dirs="`cd $$srcdir; echo 27_*/*`";; \ normal9) \ - dirs="`cd $$srcdir; echo 28_*/[ab]*`";; \ + dirs="`cd $$srcdir; echo 22_*/*`";; \ normal10) \ dirs="`cd $$srcdir; echo t*/*`";; \ + normal11) \ + dirs="`cd $$srcdir; echo 28_*/b*`";; \ + normal12) \ + dirs="`cd $$srcdir; echo 28_*/[c-z]*`";; \ + normal13) \ + dirs="`cd $$srcdir; echo ext/[n-z]*`";; \ + normal14) \ + dirs="`cd $$srcdir; echo de*/* p*/* [ab]*/* 23_*/v*`";; \ + normal15) \ + dirs="`cd $$srcdir; echo 23_*/[a-k]*`";; \ esac; \ if [ -n "$*" ]; then cd "$*"; fi; \ if $(SHELL) -c "$$runtest --version" > /dev/null 2>&1; then \ Index: libstdc++-v3/testsuite/Makefile.am === --- libstdc++-v3/testsuite/Makefile.am (revision 215147) +++ libstdc++-v3/testsuite/Makefile.am (working copy) @@ -101,7 +101,7 @@ new-abi-baseline: @test ! -f $*/site.exp || mv $*/site.exp $*/site.bak @mv $*/site.exp.tmp $*/site.exp -check_DEJAGNU_normal_targets = $(patsubst %,check-DEJAGNUnormal%,0 1 2 3 4 5 6 7 8 9 10) +check_DEJAGNU_normal_targets = $(patsubst %,check-DEJAGNUnormal%,0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1
RE: [PATCH] gcc parallel make check
>> Regtested on x86_64-linux, ok for trunk? > >Oh, forgot to say, PR56408 isn't fixed by this patch, but given the >higher granularity (10 tests instead of 1) we don't happen to trigger it >right now. which means that any commit to that dir could trigger it, right ?
RE: [PATCH] gcc parallel make check
> So, I’d love to see the numbers for 5 and 20 to double check that 10 is the > right number to pick. This sort of refinement is trivial post checkin. So, some timings with the patch, I think this is great. Doing the testing you suggest, changing the variable doesn't influence things much (at least for Fortran, and on this system). make -j32 -k check-fortran real3m27.875s -> gcc_runtest_parallelize_counter_minor == 02 (several testsuite errors: binding_label_tests_10_main.f03, binding_label_tests_11_main.f03, class_45b.f03, class_4b.f03, class_4c.f03, coarray_29_2.f90, test_common_binding_labels_3_main.f03) real3m26.234s -> gcc_runtest_parallelize_counter_minor == 05 (one additional testsuite error: whole_file_31.f90) real3m36.405s -> gcc_runtest_parallelize_counter_minor == 10 real3m38.736s -> gcc_runtest_parallelize_counter_minor == 20 check-c real8m26.935s check-c++ real7m4.165s check real 17m45.185s
RE: [PATCH] gcc parallel make check
>> > These numbers are useful to try and ensure the overhead (scaling factor) >> > is reasonable, thanks. >> >> A nice improvement indeed. The patched result is 15 times faster >> than the serial unpatched run. So there is room for improvement > > Note, the box used was oldish AMD 16-core, no ht, box, haven't tried it on > anything on a 32 core box, no ht, I see these timings: time make -j32 -k check >& log.check32 ; time make -j8 -k check >& log.check8 real18m14.562s user260m21.578s sys 264m26.042s real41m33.210s user233m4.563s sys 72m11.429s so it is not quite reaching the ideal 4x speedup. Counting the number of 'expect' processes they are nicely at around 32 and 8 for the full test, with only a very short tail near the end. So, there might be some overhead somewhere. Total user time is similar, but time in sys goes up.
msan and gcc ?
Hi, I've noticed that gcc includes a msan_interface.h file, and I'm wondering if this implies that memory sanitizer is already part of gcc. If not, are there plans to port this useful looking tool to gcc during the current stage 1 ? Cheers, Joost
RE: msan and gcc ?
> it was certainly worth it. since I see msan as a kind of valgrind replacement (similar functionality, but ~10x the speed, partially at the cost of more difficult deployment), I did a quick search in gcc bugzilla. 982 PRs mention valgrind, so such functionality is clearly heavily used.
lto and gold
I'd like to test lto on a project where objects first go through an archive, and so wanted to follow http://gcc.gnu.org/wiki/LinkTimeOptimization using 'gcc -use-linker-plugin' However, I can't get this to work. gfortran -use-linker-plugin -flto main.f90 test.f90 /data03/vondele/binutils-2.19.1/build/bin/ld: -plugin: unknown option /data03/vondele/binutils-2.19.1/build/bin/ld: use the --help option for usage information collect2: ld returned 1 exit status /data03/vondele/binutils-2.19.1/build/bin/ld -v GNU gold (GNU Binutils 2.19.1) 1.7 I guess this is some configure flag missing, does anybody have a clue? gcc configured as: /data03/vondele/gcc_lto/gcc/configure --prefix=/data03/vondele/gcc_lto/build --with-libelf=/data03/vondele/libelf-0.8.10/build/ --enable-gold --enable-languages=c,c++,fortran --disable-multilib -disable-bootstrap binutils as: ./configure --prefix=/data03/vondele/binutils-2.19.1/build --enable-gold This is what collect2 sees: /data03/vondele/gcc_lto/build/libexec/gcc/x86_64-unknown-linux-gnu/4.5.0/collect2 -plugin /data03/vondele/gcc_lto/build/libexec/gcc/x86_64-unknown-linux-gnu/4.5.0/liblto_plugin.so -plugin-opt=/data03/vondele/gcc_lto/build/libexec/gcc/x86_64-unknown-linux-gnu/4.5.0/lto-wrapper -plugin-opt=gfortran -plugin-opt=-flto -flto --eh-frame-hdr -m elf_x86_64 -dynamic-linker /lib64/ld-linux-x86-64.so.2 -use-linker-plugin /usr/lib/../lib64/crt1.o /usr/lib/../lib64/crti.o /data03/vondele/gcc_lto/build/lib/gcc/x86_64-unknown-linux-gnu/4.5.0/crtbegin.o -L/data03/vondele/gcc_lto/build/lib/gcc/x86_64-unknown-linux-gnu/4.5.0 -L/data03/vondele/gcc_lto/build/lib/gcc/x86_64-unknown-linux-gnu/4.5.0/../../../../lib64 -L/lib/../lib64 -L/usr/lib/../lib64 -L/data03/vondele/gcc_lto/build/lib/gcc/x86_64-unknown-linux-gnu/4.5.0/../../.. /tmp/ccUQ7wr3.o /tmp/cczQrSMz.o -lgfortran -lm -lgcc_s -lgcc -lc -lgcc_s -lgcc /data03/vondele/gcc_lto/build/lib/gcc/x86_64-unknown-linux-gnu/4.5.0/crtend.o /usr/lib/../lib64/crtn.o Thanks, Joost
Re: lto and gold
I guess this is some configure flag missing, does anybody have a clue? Yes, you must build with --enable-gold --enable-plugins :-) Is that for gcc or for binutils (neither documents this in ./configure --help) ? I used it for both, but only get this to work with binutils CVS, is that correct ? Now, however, I get the following error: gfortran -flto -use-linker-plugin main.f90 test1.f90 test2.f90 collect2: ld terminated with signal 6 [Aborted] ld: /data03/vondele/gcc_lto/gcc/lto-plugin/lto-plugin.c:142: parse_table_entry: Assertion `t <= 4' failed. with ==> main.f90 <== CALL S1 CALL S2 END ==> test1.f90 <== SUBROUTINE S1 END SUBROUTINE ==> test2.f90 <== SUBROUTINE S2 END SUBROUTINE S2 and similar for C based sources. Thanks, Joost
Re: Reducing fortran testcase with delta.
Hi Li, I've attached 'Fortran-aware' delta. I tries to guess cut a Fortran file in more reasonable places (e.g. between subroutine boundaries, after enddos). It works reasonably well, but is a hack. Especially with Fortran90 and modules, iterated delta runs can help a lot (i.e. first runs removes 'public/use' module statements, next round cleans more efficiently). It also features 'randomized' bisection. That helps to reduce towards a minimized testcase when iterating delta runs. I usually call it with the following script: cat do_many for i in `seq 1 30` do ~/delta-2006.08.03/delta -suffix=.f90 -test=delta.script -cp_minimal=small.f90 bug.f90 cp small.f90 small.f90.$i cp small.f90 bug.f90 done Cheers, Joost #!/usr/bin/perl -w # delta; see License.txt for copyright and terms of use use strict; # # Implementation of the delta debugging algorithm: # http://www.st.cs.uni-sb.de/dd/ # Daniel S. Wilkerson d...@cs.berkeley.edu # Notes: # The test script should not depend on the current directory to work. # Note that 1-minimality does not imply idempotency, so we could # re-run once it is stuck, perhaps with some randomization. # Global State my @chunks = ();# Once input, is read only. my @markers = (); # Delimits a dynamic subsequence of @chunks being considered. my %test_cache = ();# Cached test results. # Mark boundaries that uniquely determine the marked contents. This # is used as a shorter key to hash on than the contents themselves. # Since Perl hashes retain their keys if you don't do this you get a # horrible memory leak in the test_cache. my $mark_signature; # End of the last marker rendered to the tmp file. Used to figure out # if the next one abuts it or not. my $last_mark_stop; my @current_markers;# Markers to be rendered to $tmpinput if answer not in cache. my $tmpinput; # Temporary file to render marked subsequence to. my $last_successful_tmpinput; # Last one to past the test. my $tmp_index = 0; # Cache the last index used to make a tmp file. my $tmpdir_index = 0; # Cache the last index used to make a tmp directory. my $tmpdir; # Temporary directory for external programs. my $logfile = "log";# File in $tmpdir where log of successful runs is written. chomp (my $this_dir = `pwd`); # The current directory. my $starttime = time; # The time we started. my $granularity = "line"; # What is the size of an input chunk? my $dump_input = 0; # Dump out the input after reading it in. my $cp_minimal; # Copy the minimal successful test to the current dir. my $verbose = 0;# Be more verbose. my $quiet = 0; # Prints go to /dev/null. my $suffix = ".c"; # For now, our input files are .c files. my $test; # The script to run as the test. # when true, all operations on input file are in-place: # - don't make a new directory # - overwrite the original input file with our constructed inputs my $in_place = 0; my $start_file; # name of input/output file for in_place my $help_message = <<"END" Delta version 2003.7.14 delta implements the delta-debugging algorithm: http://www.st.cs.uni-sb.de/dd/ Implemented by Daniel Wilkerson. usage: $0 [options] start-file -test= Specify the test script. -suffix= Candidate filename suffix [$suffix] -dump_input Dump input after reading -cp_minimal= Copy the minimal successful test to the current directory -granularity=lineUse lines as the granularity (default) -granularity=top_formUse C top-level forms as the granularity (currently only works with CIL output) -log= Log file for main events -quiet Say nothing -verbose Get more verbose output -in_placeOverwrite start-file with inputs -helpGet help The test program accepts a single argument, the name of the candidate file to test. It is run within a directory containing only that file, and it can make temporary files/directories in that directory. It should return zero for a candidate that exhibits the desired property, and nonzero for one that does not. Example test program (delta will retain a line containing "foo"): #!/bin/sh grep 'foo' <"\$1" >/dev/null END ; # Functions sub output(@) { print @_ unless $quiet; } # Return true if the current_markers pass the interesting test. sub test { if (-f "DELTA-STOP") { output "Stopping because DELTA-STOP file exists\n"; exit 1; } my $cached_result = $test_cache{$mark_signature}; if (defined $cached_result) { output
trunk bootstrap failure?
Current trunk fails for me with /data04/vondele/gcc_trunk/obj/./gcc/xgcc -B/data04/vondele/gcc_trunk/obj/./gcc/ -B/data04/vondele/gcc_trunk/build/x86_64-unknown-linux-gnu/bin/ -B/data04/vondele/gcc_trunk/build/x86_64-unknown-linux-gnu/lib/ -isystem /data04/vondele/gcc_trunk/build/x86_64-unknown-linux-gnu/include -isystem /data04/vondele/gcc_trunk/build/x86_64-unknown-linux-gnu/sys-include -g -O2 -O2 -g -O2 -DIN_GCC -W -Wall -Wwrite-strings -Wstrict-prototypes -Wmissing-prototypes -Wcast-qual -Wold-style-definition -isystem ./include -fPIC -g -DHAVE_GTHR_DEFAULT -DIN_LIBGCC2 -D__GCC_FLOAT_NOT_NEEDED -I. -I. -I../.././gcc -I/data04/vondele/gcc_trunk/gcc/libgcc -I/data04/vondele/gcc_trunk/gcc/libgcc/. -I/data04/vondele/gcc_trunk/gcc/libgcc/../gcc -I/data04/vondele/gcc_trunk/gcc/libgcc/../include -I/data04/vondele/gcc_trunk/gcc/libgcc/config/libbid -DENABLE_DECIMAL_BID_FORMAT -DHAVE_CC_TLS -DUSE_TLS -o _trampoline.o -MT _trampoline.o -MD -MP -MF _trampoline.dep -DL_trampoline -c /data04/vondele/gcc_trunk/gcc/libgcc/../gcc/libgcc2.c \ -fvisibility=hidden -DHIDE_EXPORTS In file included from /usr/include/features.h:354, from /usr/include/stdio.h:28, from /data04/vondele/gcc_trunk/gcc/libgcc/../gcc/tsystem.h:90, from /data04/vondele/gcc_trunk/gcc/libgcc/../gcc/libgcc2.c:33: /usr/include/gnu/stubs.h:7:27: error: gnu/stubs-32.h: No such file or directory In file included from /usr/include/features.h:354, from /usr/include/stdio.h:28, from /data04/vondele/gcc_trunk/gcc/libgcc/../gcc/tsystem.h:90, from /data04/vondele/gcc_trunk/gcc/libgcc/../gcc/libgcc2.c:33: /usr/include/gnu/stubs.h:7:27: error: gnu/stubs-32.h: No such file or directory In file included from /usr/include/features.h:354, from /usr/include/stdio.h:28, from /data04/vondele/gcc_trunk/gcc/libgcc/../gcc/tsystem.h:90, from /data04/vondele/gcc_trunk/gcc/libgcc/../gcc/libgcc2.c:33: /usr/include/gnu/stubs.h:7:27: error: gnu/stubs-32.h: No such file or directory In file included from /usr/include/features.h:354, from /usr/include/stdio.h:28, from /data04/vondele/gcc_trunk/gcc/libgcc/../gcc/tsystem.h:90, from /data04/vondele/gcc_trunk/gcc/libgcc/../gcc/libgcc2.c:33: /usr/include/gnu/stubs.h:7:27: error: gnu/stubs-32.h: No such file or directory make[5]: *** [_muldi3.o] Error 1 make[5]: *** Waiting for unfinished jobs /data04/vondele/gcc_trunk/obj/./gcc/xgcc -B/data04/vondele/gcc_trunk/obj/./gcc/ -B/data04/vondele/gcc_trunk/build/x86_64-unknown-linux-gnu/bin/ -B/data04/vondele/gcc_trunk/build/x86_64-unknown-linux-gnu/lib/ -isystem /data04/vondele/gcc_trunk/build/x86_64-unknown-linux-gnu/include -isystem /data04/vondele/gcc_trunk/build/x86_64-unknown-linux-gnu/sys-include -g -O2 -O2 -g -O2 -DIN_GCC -W -Wall -Wwrite-strings -Wstrict-prototypes -Wmissing-prototypes -Wcast-qual -Wold-style-definition -isystem ./include -fPIC -g -DHAVE_GTHR_DEFAULT -DIN_LIBGCC2 -D__GCC_FLOAT_NOT_NEEDED -I. -I. -I../.././gcc -I/data04/vondele/gcc_trunk/gcc/libgcc -I/data04/vondele/gcc_trunk/gcc/libgcc/. -I/data04/vondele/gcc_trunk/gcc/libgcc/../gcc -I/data04/vondele/gcc_trunk/gcc/libgcc/../include -I/data04/vondele/gcc_trunk/gcc/libgcc/config/libbid -DENABLE_DECIMAL_BID_FORMAT -DHAVE_CC_TLS -DUSE_TLS -o __main.o -MT __main.o -MD -MP -MF __main.dep -DL__main -c /data04/vondele/gcc_trunk/gcc/libgcc/../gcc/libgcc2.c \ -fvisibility=hidden -DHIDE_EXPORTS make[5]: *** [_negdi2.o] Error 1 /data04/vondele/gcc_trunk/obj/./gcc/xgcc -B/data04/vondele/gcc_trunk/obj/./gcc/ -B/data04/vondele/gcc_trunk/build/x86_64-unknown-linux-gnu/bin/ -B/data04/vondele/gcc_trunk/build/x86_64-unknown-linux-gnu/lib/ -isystem /data04/vondele/gcc_trunk/build/x86_64-unknown-linux-gnu/include -isystem /data04/vondele/gcc_trunk/build/x86_64-unknown-linux-gnu/sys-include -g -O2 -O2 -g -O2 -DIN_GCC -W -Wall -Wwrite-strings -Wstrict-prototypes -Wmissing-prototypes -Wcast-qual -Wold-style-definition -isystem ./include -fPIC -g -DHAVE_GTHR_DEFAULT -DIN_LIBGCC2 -D__GCC_FLOAT_NOT_NEEDED -I. -I. -I../.././gcc -I/data04/vondele/gcc_trunk/gcc/libgcc -I/data04/vondele/gcc_trunk/gcc/libgcc/. -I/data04/vondele/gcc_trunk/gcc/libgcc/../gcc -I/data04/vondele/gcc_trunk/gcc/libgcc/../include -I/data04/vondele/gcc_trunk/gcc/libgcc/config/libbid -DENABLE_DECIMAL_BID_FORMAT -DHAVE_CC_TLS -DUSE_TLS -o _absvsi2.o -MT _absvsi2.o -MD -MP -MF _absvsi2.dep -DL_absvsi2 -c /data04/vondele/gcc_trunk/gcc/libgcc/../gcc/libgcc2.c \ -fvisibility=hidden -DHIDE_EXPORTS make[5]: *** [_lshrdi3.o] Error 1 /data04/vondele/gcc_trunk/obj/./gcc/xgcc -B/data04/vondele/gcc_trunk/obj/./gcc/ -B/data04/vondele/gcc_trunk/build/x86_64-unknown-linux-gnu/bin/ -B/data04/vondele/gcc_trunk/build/x86_64-unknown-linux-gnu/lib/ -isystem /data04/von
Re: trunk bootstrap failure?
thats is on a standard linux (x86_64) box running opensuse 11.0, and a clean checkout. Is this a known problem? You haven't installed the 32-bit glibc devel package. Many thanks, that fixed it. Would be great if such a thing could be detected at configure time (i.e. like missing mpfr.h headers are already detected), with some kind of a gentle error message.
Re: trunk bootstrap failure?
Would be great if such a thing could be detected at configure time (i.e. like missing mpfr.h headers are already detected), with some kind of a gentle error message. It wouldn't be detected until the target libs are built, since that's the first time any 32-bit headers are needed. Patches welcome. Is this useful ? Index: install.texi === --- install.texi(revision 142790) +++ install.texi(working copy) @@ -4070,6 +4070,7 @@ (amd64-*-* is an alias for x86_64-*-*) on GNU/Linux, FreeBSD and net...@. On GNU/Linux the default is a bi-arch compiler which is able to generate both 64-bit x86-64 and 32-bit x86 code (via the @option{-m32} switch). +This requires that both 32 and 64 bit header files are installed on the system. @html also, this likely fixes a typo Index: cvs.html === RCS file: /cvs/gcc/wwwdocs/htdocs/cvs.html,v retrieving revision 1.213 diff -c -p -r1.213 cvs.html *** cvs.html30 Dec 2007 09:01:19 - 1.213 --- cvs.html17 Dec 2008 12:04:09 - *** and SSH installed, you can check out the *** 36,42 Set CVS_RSH in your environment to ssh. Set CVSROOT in your environment to :pserver:c...@gcc.gnu.org:/cvs/gcc. ! Alternately add -d :pserver:c...@gcc.gnu.org:/cvs/gcc immediately after cvs in the commands below. The command cvs -qz9 checkout -P wwwdocs, --- 36,42 Set CVS_RSH in your environment to ssh. Set CVSROOT in your environment to :pserver:c...@gcc.gnu.org:/cvs/gcc. ! Alternatively add -d :pserver:c...@gcc.gnu.org:/cvs/gcc immediately after cvs in the commands below. The command cvs -qz9 checkout -P wwwdocs,
Re: gfortran / gdb question
actually, I just find out that this seems a 4.4 issue, compiled with 4.3 the gdb session just goes fine... I also seem to be able to debug small examples with either 4.3 or 4.4, just CP2K seems to cause troubles (as usual ;-) I've filed PR39073 for this, somehow hope this can be solved before release (ugh.. show it is not Fortran (I've made it debug) and declare it P1?) http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39073
optimization question
the attached code (see contract__sparse) is a kernel which I hope gets optimized well. Unfortunately, compiling (on opteron or core2) it as gfortran -O3 -march=native -ffast-math -funroll-loops -ffree-line-length-200 test.f90 ./a.out Sparse: time[s] 0.66804099 New: time[s] 0.20801300 speedup3.2115347 Glfops3.1151900 Error: 1.11022302462515654E-016 shows that the hand-optimized version (see contract__test) is about 3x faster. I played around with options, but couldn't get gcc to generate fast code for the original source. I think that this would involve unrolling a loop and scalarizing the scratch arrays buffer1 and buffer2 (as done in the hand-optimized version). So, is there any combination of options to get that effect? Second question, even the code generated for the hand-optimized version is not quite ideal. The asm of the inner loop appears (like the source) to contain about 4*81 multiplies. However, a 'smarter' way to do the calculation would be to compute the constants used for multiplying work(i) by retaining common subexpressions (i.e. all values of sa_i * sb_j * sc_k * sd_l * work[n] can be computed in 9+9+81+81 multiplies instead of the current scheme, which has 4*81). That could bring another factor of 2 speedup. Is there a chance to have gcc see this, or does this need to be done on the source level ? If considered useful, I can add a PR to bugzilla with the testcase. Joost MODULE TEST IMPLICIT NONE INTEGER :: l INTEGER, PARAMETER :: dp=8 INTEGER, PARAMETER :: nco(0:3)=(/((l+1)*(l+2)/2,l=0,3)/) INTEGER, PARAMETER :: nso(0:3)=(/(2*l+1,l=0,3)/) CONTAINS SUBROUTINE contract__sparse(work, & nl_a, nl_b, nl_c, nl_d,& sphi_a, sphi_b, sphi_c, sphi_d,& primitives,& s_offset_a, s_offset_b, s_offset_c, s_offset_d) REAL(dp), DIMENSION(3*3*3*3), INTENT(IN) :: work INTEGER :: nl_a, nl_b, nl_c, nl_d REAL(dp), DIMENSION(3,3*nl_a), INTENT(IN) :: sphi_a REAL(dp), DIMENSION(3,3*nl_b), INTENT(IN) :: sphi_b REAL(dp), DIMENSION(3,3*nl_c), INTENT(IN) :: sphi_c REAL(dp), DIMENSION(3,3*nl_d), INTENT(IN) :: sphi_d REAL(dp), DIMENSION(3*nl_a, 3*nl_b,3*nl_c,3*nl_d) :: primitives INTEGER, INTENT(IN) :: s_offset_a, s_offset_b, s_offset_c, s_offset_d REAL(dp), DIMENSION(3* 3*3*3) :: buffer1, buffer2 INTEGER :: imax,jmax,kmax, ia, ib, ic, id, s_offset_a1, s_offset_b1, s_offset_c1, s_offset_d1,& i1 ,i2, i3, i, j, k s_offset_a1 = 0 DO ia = 1,nl_a s_offset_b1 = 0 DO ib = 1,nl_b s_offset_c1 = 0 DO ic = 1,nl_c s_offset_d1 = 0 DO id = 1,nl_d buffer1 = 0.0_dp imax=3*3*3 jmax=3 kmax=3 DO i=1,imax buffer1(i+imax*(3-1)) = buffer1(i+imax*(3-1)) + work(1+(i-1)*kmax) * sphi_a(1,3+s_offset_a1) buffer1(i+imax*(1-1)) = buffer1(i+imax*(1-1)) + work(2+(i-1)*kmax) * sphi_a(2,1+s_offset_a1) buffer1(i+imax*(2-1)) = buffer1(i+imax*(2-1)) + work(3+(i-1)*kmax) * sphi_a(3,2+s_offset_a1) ENDDO buffer2 = 0.0_dp imax=3*3*3 jmax=3 kmax=3 DO i=1,imax buffer2(i+imax*(3-1)) = buffer2(i+imax*(3-1)) + buffer1(1+(i-1)*kmax) * sphi_b(1,3+s_offset_b1) buffer2(i+imax*(1-1)) = buffer2(i+imax*(1-1)) + buffer1(2+(i-1)*kmax) * sphi_b(2,1+s_offset_b1) buffer2(i+imax*(2-1)) = buffer2(i+imax*(2-1)) + buffer1(3+(i-1)*kmax) * sphi_b(3,2+s_offset_b1) ENDDO buffer1 = 0.0_dp imax=3*3*3 jmax=3 kmax=3 DO i=1,imax buffer1(i+imax*(3-1)) = buffer1(i+imax*(3-1)) + buffer2(1+(i-1)*kmax) * sphi_c(1,3+s_offset_c1) buffer1(i+imax*(1-1)) = buffer1(i+imax*(1-1)) + buffer2(2+(i-1)*kmax) * sphi_c(2,1+s_offset_c1) buffer1(i+imax*(2-1)) = buffer1(i+imax*(2-1)) + buffer2(3+(i-1)*kmax) * sphi_c(3,2+s_offset_c1) ENDDO imax=3*3*3 jmax=3 kmax=3 i = 0 DO i1=1,3 DO i2=1,3 DO i3=1,3 i = i + 1 primitives(s_offset_a1+i3, s_offset_b1+i2, s_offset_c1+i1, s_offset_d1+3) =& primitives(s_offset_a1+i3, s_offset_b1+i2, s_offset_c1+i1, s_offset_d1+3) & + buffer1(1+(i-1)*kmax) * sphi_d(1,3+s_offset_d1) primitives(s_offset_a1+i3, s_offset_b1+i2, s_offset_c1+i1, s_offset_d1+1) =& primitives(s_offset_a1+i3, s_offset_b1+i2, s_offset_c1+i1, s_offset_d1+1) & + buffer1(2+(i-1)*kmax) * sphi_d(2,1+s_offset_d1) primitives(s_offset_a1+i3, s_offset_b1+i2, s_offset_c1+i1, s_offset_d1+2) =& primitives(s_offset_a1+i3, s_offset_b1+i2, s_offset_c1+i1, s_offset_d1+2) & + buffer1(3+(i-1)*kmax) * sphi_d(3,2+s_offset_d1) ENDDO ENDDO ENDDO s_offset_d1 = s_offset_d1 + 3 END DO s_offset_c1 = s_offset_c1 + 3 END DO s_offset_b1 = s_off
Re: optimization question
thanks for the info I think it is useful to have a bugzilla here. will do. I tested 4.4, what did you test? 4.3 4.4 4.5 Joost
Re: optimization question
I think it is useful to have a bugzilla here. will do. http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40168 Btw, complete unrolling is also hindred by the artificial limit of maximally unrolling 16 iterations. Your inner loops iterate 27 times. Also by the artificial limit of the maximal unrolled size. With --param max-completely-peel-times=27 --param max-completely-peeled-insns=666 (values for trunk) the loops are unrolled at -O3. hmmm. but leading to slower code.
GCC performance with CP2K
I've just tested gcc/gfortran with CP2K, which some of you might know from PR29975 and other messages to the list, and observed some very pleasing evolution in the runtime of the code. In each case the set of compilation options is '-O2 -ffast-math -funroll-loops -ftree-vectorize -march=native' (-march=k8-sse3), the intel reference '-O2 -xW -heap-arrays 64' version subroutine time[s] out.intel: CP2K 504.52 out.gfortran.4.2.3: CP2K 601.35 out.gfortran.4.3.0: CP2K 569.42 out.gfortran.4.4.0: CP2K 508.12 I hope that this rate of improvement sets a standard up to gcc 4.95.3 ;-) Thanks for your efforts... Cheers, Joost
bootstrap broken?
dwarf2out.c:13496: internal compiler error: in extract_insn, at recog.c:1988 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37045
CP2K gcc nightly benchmark / wwwdocs patch
A nightly tester has been set up to track the performance of the gcc/gfortran compiler (trunk) for typical CP2K runs. Results and code can be found at: http://cp2k.berlios.de/gfortran/ I'll consider your suggestions for improvements. The following patch could be applied to the wwwdocs Index: index.html === RCS file: /cvs/gcc/wwwdocs/htdocs/benchmarks/index.html,v retrieving revision 1.26 diff -r1.26 index.html 83a84,90 Joost VandeVondele runs a CP2K benchmark with mainline GCC. Results can be found at http://cp2k.berlios.de/gfortran/"; >http://cp2k.berlios.de/gfortran/. Cheers, Joost
vectorizer question
The attached testcase yields (on a core2 duo, gcc trunk): gfortran -O3 -ftree-vectorize -ffast-math -march=native test.f90 time ./a.out real0m3.414s ifort -xT -O3 test.f90 time ./a.out real0m1.556s The assembly contains: ifort gfortran mulpd 140 0 mulsd 0280 so the reason seems that ifort vectorizes the following code (full testcase attached): SUBROUTINE collocate_core_6(res,coef_xyz,pol_x,pol_y,pol_z,cmax,kg,jg) IMPLICIT NONE INTEGER, PARAMETER :: wp = SELECTED_REAL_KIND ( 14, 200 ) integer, PARAMETER :: lp=6 real(wp), INTENT(OUT):: res integer, INTENT(IN) :: cmax,kg,jg real(wp), INTENT(IN):: pol_x(0:lp,-cmax:cmax) real(wp), INTENT(IN):: pol_y(1:2,0:lp,-cmax:0) real(wp), INTENT(IN):: pol_z(1:2,0:lp,-cmax:0) real(wp), INTENT(IN):: coef_xyz(((lp+1)*(lp+2)*(lp+3))/6) real(wp) :: coef_xy(2,(lp+1)*(lp+2)/2) real(wp) :: coef_x(4,0:lp) [...] coef_x(1:2,4)=coef_x(1:2,4)+coef_xy(1:2,12)*pol_y(1,1,jg) coef_x(3:4,4)=coef_x(3:4,4)+coef_xy(1:2,12)*pol_y(2,1,jg) coef_x(1:2,5)=coef_x(1:2,5)+coef_xy(1:2,13)*pol_y(1,1,jg) coef_x(3:4,5)=coef_x(3:4,5)+coef_xy(1:2,13)*pol_y(2,1,jg) coef_x(1:2,0)=coef_x(1:2,0)+coef_xy(1:2,14)*pol_y(1,2,jg) coef_x(3:4,0)=coef_x(3:4,0)+coef_xy(1:2,14)*pol_y(2,2,jg) coef_x(1:2,1)=coef_x(1:2,1)+coef_xy(1:2,15)*pol_y(1,2,jg) coef_x(3:4,1)=coef_x(3:4,1)+coef_xy(1:2,15)*pol_y(2,2,jg) coef_x(1:2,2)=coef_x(1:2,2)+coef_xy(1:2,16)*pol_y(1,2,jg) coef_x(3:4,2)=coef_x(3:4,2)+coef_xy(1:2,16)*pol_y(2,2,jg) coef_x(1:2,3)=coef_x(1:2,3)+coef_xy(1:2,17)*pol_y(1,2,jg) coef_x(3:4,3)=coef_x(3:4,3)+coef_xy(1:2,17)*pol_y(2,2,jg) coef_x(1:2,4)=coef_x(1:2,4)+coef_xy(1:2,18)*pol_y(1,2,jg) coef_x(3:4,4)=coef_x(3:4,4)+coef_xy(1:2,18)*pol_y(2,2,jg) coef_x(1:2,0)=coef_x(1:2,0)+coef_xy(1:2,19)*pol_y(1,3,jg) coef_x(3:4,0)=coef_x(3:4,0)+coef_xy(1:2,19)*pol_y(2,3,jg) [...] either it is able to interpret the short vectors as such, or it realizes that these very short implicit loops are nevertheless favourable for vectorization. Is there a trick to get gcc vectorize these loops, or is there some technology missing for this ? Should I file a PR for this (this is somewhat similar to PR31079 and PR31021)? Thanks in advance, Joost SUBROUTINE collocate_core_6(res,coef_xyz,pol_x,pol_y,pol_z,cmax,kg,jg) IMPLICIT NONE INTEGER, PARAMETER :: wp = SELECTED_REAL_KIND ( 14, 200 ) integer, PARAMETER :: lp=6 real(wp), INTENT(OUT):: res integer, INTENT(IN) :: cmax,kg,jg real(wp), INTENT(IN):: pol_x(0:lp,-cmax:cmax) real(wp), INTENT(IN):: pol_y(1:2,0:lp,-cmax:0) real(wp), INTENT(IN):: pol_z(1:2,0:lp,-cmax:0) real(wp), INTENT(IN):: coef_xyz(((lp+1)*(lp+2)*(lp+3))/6) real(wp) :: coef_xy(2,(lp+1)*(lp+2)/2) real(wp) :: coef_x(4,0:lp) coef_xy=0.0_wp coef_xy(:,1)=coef_xy(:,1)+coef_xyz(1)*pol_z(:,0,kg) coef_xy(:,2)=coef_xy(:,2)+coef_xyz(2)*pol_z(:,0,kg) coef_xy(:,3)=coef_xy(:,3)+coef_xyz(3)*pol_z(:,0,kg) coef_xy(:,4)=coef_xy(:,4)+coef_xyz(4)*pol_z(:,0,kg) coef_xy(:,5)=coef_xy(:,5)+coef_xyz(5)*pol_z(:,0,kg) coef_xy(:,6)=coef_xy(:,6)+coef_xyz(6)*pol_z(:,0,kg) coef_xy(:,7)=coef_xy(:,7)+coef_xyz(7)*pol_z(:,0,kg) coef_xy(:,8)=coef_xy(:,8)+coef_xyz(8)*pol_z(:,0,kg) coef_xy(:,9)=coef_xy(:,9)+coef_xyz(9)*pol_z(:,0,kg) coef_xy(:,10)=coef_xy(:,10)+coef_xyz(10)*pol_z(:,0,kg) coef_xy(:,11)=coef_xy(:,11)+coef_xyz(11)*pol_z(:,0,kg) coef_xy(:,12)=coef_xy(:,12)+coef_xyz(12)*pol_z(:,0,kg) coef_xy(:,13)=coef_xy(:,13)+coef_xyz(13)*pol_z(:,0,kg) coef_xy(:,14)=coef_xy(:,14)+coef_xyz(14)*pol_z(:,0,kg) coef_xy(:,15)=coef_xy(:,15)+coef_xyz(15)*pol_z(:,0,kg) coef_xy(:,16)=coef_xy(:,16)+coef_xyz(16)*pol_z(:,0,kg) coef_xy(:,17)=coef_xy(:,17)+coef_xyz(17)*pol_z(:,0,kg) coef_xy(:,18)=coef_xy(:,18)+coef_xyz(18)*pol_z(:,0,kg) coef_xy(:,19)=coef_xy(:,19)+coef_xyz(19)*pol_z(:,0,kg) coef_xy(:,20)=coef_xy(:,20)+coef_xyz(20)*pol_z(:,0,kg) coef_xy(:,21)=coef_xy(:,21)+coef_xyz(21)*pol_z(:,0,kg) coef_xy(:,22)=coef_xy(:,22)+coef_xyz(22)*pol_z(:,0,kg) coef_xy(:,23)=coef_xy(:,23)+coef_xyz(23)*pol_z(:,0,kg) coef_xy(:,24)=coef_xy(:,24)+coef_xyz(24)*pol_z(:,0,kg) coef_xy(:,25)=coef_xy(:,25)+coef_xyz(25)*pol_z(:,0,kg) coef_xy(:,26)=coef_xy(:,26)+coef_xyz(26)*pol_z(:,0,kg) coef_xy(:,27)=coef_xy(:,27)+coef_xyz(27)*pol_z(:,0,kg) coef_xy(:,28)=coef_xy(:,28)+coef_xyz(28)*pol_z(:,0,kg) coef_xy(:,1)=coef_xy(:,1)+coef_xyz(29)*pol_z(:,1,kg) coef_xy(:,2)=coef_xy(:,2)+coef_xyz(30)*pol_z(:,1,kg) coef_xy(:,3)=coef_xy(:,3)+coef_xyz(31)*pol_z(:,1,kg) coef_xy(:,4)=coef_xy(:,4)+coef_xyz(32)*pol_z(:,1,kg) coef_xy(:,5)=coef_xy(:,5)+coef_xyz(33)*pol_z(:,1,kg) coef_xy(:,6)=coef_xy(:,6)+coef_xyz(34)*pol_z(:,1,kg) coef_xy(:,8)=coef_xy(:,8)+coef_xyz(35)*pol_z(:,1,kg) coef_xy(:,9)=coef_xy(:,9)+coef_xyz(36)*pol_z(:,1,kg)
Re: vectorizer question
It would be nice to have a stand-alone testcase for this, so please file a bugreport. I've opened PR37150 for this. Thanks, Joost