Attribute visibility
Hello, In configuring gcc, there's a check to determine if the linker has support for STV_Hidden. It determines support by checking for the date in the linker's version string. On the other hand, the version string for the linker is constructed without a date, bfd/Makefile.am recipe for bfdver.h. I am enclosing the lines I am referring to in this email. It looks like there's a disconnect. This results in incorrectly identifying a linker with support for STV_HIDDEN as one that doesn't. This is from GCC 8.3.1 and Binutils 2.31.1. Is this known/intended? Thanks, Visda gcc/configure if test $in_tree_ld != yes ; then ld_ver=`$gcc_cv_ld --version 2>/dev/null | sed 1q` if echo "$ld_ver" | grep GNU > /dev/null; then if test x"$ld_is_gold" = xyes; then # GNU gold --version looks like this: # # GNU gold (GNU Binutils 2.21.51.20110225) 1.11 # # We extract the binutils version which is more familiar and specific # than the gold version. ld_vers=`echo $ld_ver | sed -n \ -e 's,^[^)]*[ ]\([0-9][0-9]*\.[0-9][0-9]*[^)]*\)) .*$,\1,p'` else # GNU ld --version looks like this: # # GNU ld (GNU Binutils) 2.21.51.20110225 ld_vers=`echo $ld_ver | sed -n \ -e 's,^.*[ ]\([0-9][0-9]*\.[0-9][0-9]*.*\)$,\1,p'` fi ld_date=`echo $ld_ver | sed -n 's,^.*\([2-9][0-9][0-9][0-9]\)[-]*\([01][0-9]\)[-]*\([0-3][0-9]\).*$,\1\2\3,p'` <---looking for a -MM-DD ld_vers_major=`expr "$ld_vers" : '\([0-9]*\)'` ld_vers_minor=`expr "$ld_vers" : '[0-9]*\.\([0-9]*\)'` ld_vers_patch=`expr "$ld_vers" : '[0-9]*\.[0-9]*\.\([0-9]*\)'` else case "${target}" in *-*-solaris2*) # Solaris 2 ld -V output looks like this for a regular version: ... { $as_echo "$as_me:${as_lineno-$LINENO}: checking linker for .hidden support" >&5 $as_echo_n "checking linker for .hidden support... " >&6; } if test "${gcc_cv_ld_hidden+set}" = set; then : $as_echo_n "(cached) " >&6 else if test $in_tree_ld = yes ; then gcc_cv_ld_hidden=no if test "$gcc_cv_gld_major_version" -eq 2 -a "$gcc_cv_gld_minor_version" -ge 13 -o "$gcc_cv_gld_major_version" -gt 2 \ && test $in_tree_ld_is_elf = yes; then gcc_cv_ld_hidden=yes fi else gcc_cv_ld_hidden=yes if test x"$ld_is_gold" = xyes; then : elif echo "$ld_ver" | grep GNU > /dev/null; then case "${target}" in mmix-knuth-mmixware) # The linker emits by default mmo, not ELF, so "no" is appropriate. gcc_cv_ld_hidden=no ;; esac if test 0"$ld_date" -lt 20020404; then <- linker released before 20020404 doesn't support .hidden if test -n "$ld_date"; then # If there was date string, but was earlier than 2002-04-04, fail gcc_cv_ld_hidden=no bfd/Makefile.am the default in development.sh is development=false bfdver.h: $(srcdir)/version.h $(srcdir)/development.sh $(srcdir)/Makefile.in @echo "creating $@" @bfd_version=`echo "$(VERSION)" | $(SED) -e 's/\([^\.]*\)\.*\([^\.]*\)\.*\([^\.]*\)\.*\([^\.]*\)\.*\([^\.]*\).*/\1.00\2.00\3.00\4.00\5/' -e 's/\([^\.]*\)\..*\(..\)\..*\(..\)\..*\(..\)\..*\(..\)$$/\1\2\3\4\5/'` ;\ bfd_version_string="\"$(VERSION)\"" ;\ bfd_soversion="$(VERSION)" ;\ bfd_version_package="\"$(PKGVERSION)\"" ;\ report_bugs_to="\"$(REPORT_BUGS_TO)\"" ;\ . $(srcdir)/development.sh ;\ if test "$$development" = true ; then \ bfd_version_date=`$(SED) -n -e 's/.*DATE //p' < $(srcdir)/version.h` ;\ bfd_version_string="\"$(VERSION).$${bfd_version_date}\"" ;\ bfd_soversion="$(VERSION).$${bfd_version_date}" ;\ fi ;\ $(SED) -e "s,@bfd_version@,$$bfd_version," \ -e "s,@bfd_version_string@,$$bfd_version_string," \ -e "s,@bfd_version_package@,$$bfd_version_package," \ -e "s,@report_bugs_to@,$$report_bugs_to," \ < $(srcdir)/version.h > $@; \ echo "$${bfd_soversion}" > libtool-soversion
-fstack-usage and -flto the stack usage report is generated
Hello, With link time optimization the stack usage information is determined during local transformation and written to program.ltrans0.ltrans.su. There will be one .su file for each partition. All ltrans files, including the .su, are removed unless –save-temps is indicated. Although not obvious/not documented, but I am assuming this is working as designed. Developers should include –save-temps along with -fstack-suage and -flto. Or is this a bug? That is exception should be made for the ltrans.su files and they should be kept around regardless. Thanks, Visda
Re: -fstack-usage and -flto the stack usage report is generated
On 2021-02-01, 3:19 AM, "Martin Liška" wrote: EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe On 1/29/21 3:57 PM, Visda.Vokhshoori--- via Gcc wrote: > > Hello, Hi. > > With link time optimization the stack usage information is determined during local transformation and written to program.ltrans0.ltrans.su. There will be one .su file for each partition. > > All ltrans files, including the .su, are removed unless –save-temps is indicated. > > Although not obvious/not documented, but I am assuming this is working as designed. Developers should include –save-temps along with -fstack-suage and -flto. >>Thank you for the report. It seems it's a bug, please report it to bugzilla. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98922 >>Martin > > Or is this a bug? That is exception should be made for the ltrans.su files and they should be kept around regardless. > > Thanks, > Visda > Thanks, Visda
22% degradation seen in embench:matmult-int
Embench is used for benchmarking on embedded devices. This one project matmult-int has a function Multiply. It’s a matrix multiplication for 20 x 20 matrix. The device is a ATSAME70Q21B which is Cortex-M7 The compiler is arm branch based on GCC version 13 We are compiling with O3 which has loop-interchange pass on by default. When we compile with -fno-loop-interchange we get all 22% back plus 5% speed up. When we do the loop interchange on the one loop nest that get interchanged it is slightly (.7%) faster. Has anyone else seen large degradation as a result of loop interchange? Thanks
Re: 22% degradation seen in embench:matmult-int
* When we do the loop interchange on the one loop nest that get interchanged in the program source it is slightly (.7%) faster. From: Gcc on behalf of Visda.Vokhshoori--- via Gcc Date: Wednesday, February 12, 2025 at 10:38 AM To: gcc@gcc.gnu.org Subject: 22% degradation seen in embench:matmult-int EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe Embench is used for benchmarking on embedded devices. This one project matmult-int has a function Multiply. It’s a matrix multiplication for 20 x 20 matrix. The device is a ATSAME70Q21B which is Cortex-M7 The compiler is arm branch based on GCC version 13 We are compiling with O3 which has loop-interchange pass on by default. When we compile with -fno-loop-interchange we get all 22% back plus 5% speed up. When we do the loop interchange on the one loop nest that get interchanged it is slightly (.7%) faster. Has anyone else seen large degradation as a result of loop interchange? Thanks
Re: 22% degradation seen in embench:matmult-int
“tem = Index == 0 ? 0 : (*(matrix *)Res)[Outer][Inner];” When I compared the assembly statement of the loop these extra statements are in the most inner loop. 400968: 2800cmp r0, #0 400972: bf08it eq 400974: 2300moveq r3, #0 R3 being Res. It is the statement you have above. “a CPU uarch with caches and HW prefetching where linear accesses are a lot more efficient than strided ones - that might not hold at all for the Cortex-M7.” Yes that’s it. Thanks a lot for your help! From: Richard Biener Date: Friday, February 14, 2025 at 2:26 AM To: Visda Vokhshoori - C51841 Cc: gcc@gcc.gnu.org Subject: Re: 22% degradation seen in embench:matmult-int [You don't often get email from richard.guent...@gmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ] EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe On Thu, Feb 13, 2025 at 9:30 PM wrote: > > > > “the interchanged loop might for example no longer vectorize.” > > > > The loops are not vectorized. Which is ok, because this device doesn’t have > the support for it. > > I just don’t think a pass could single handedly make code slower that much. > > > > Loop interchange is supposed to interchange the loop nest index with outer > index to improve cache locality. This is supposed to help -that is the next > iteration we will have the data available in cache. > > > > The benchmark source –and the loop that gets interchanged is line 143 > > > > Source: > https://github.com/embench/embench-iot/blob/master/src/matmult-int/matmult-int.c#L143 Looks like the classical matmul loop, similar to the one in SPEC CPU bwaves. We do apply interchange here and that looks reasonable to me. Note interchange assumes a CPU uarch with caches and HW prefetching where linear accesses are a lot more efficient than strided ones - that might not hold at all for the Cortex-M7. Without interchange the store to Res[] can be moved out of the inner loop. I've tried #define UPPERLIMIT 20 typedef long matrix[UPPERLIMIT][UPPERLIMIT]; void Multiply (matrix A, matrix B, long * __restrict Res) { register int Outer, Inner, Index; for (Outer = 0; Outer < UPPERLIMIT; Outer++) for (Inner = 0; Inner < UPPERLIMIT; Inner++) { (*(matrix *)Res)[Outer][Inner] = 0; for (Index = 0; Index < UPPERLIMIT; Index++) (*(matrix *)Res)[Outer][Inner] += A[Outer][Index] * B[Index][Inner]; } } and this is interchanged on x86_64 as well. We are implementing a trick for the zeroing which, when moved into innermost position is done as for (Index = 0; Index < UPPERLIMIT; Index++) for (Inner = 0; Inner < UPPERLIMIT; Inner++) { tem = Index == 0 ? 0 : (*(matrix *)Res)[Outer][Inner]; tem += A[Outer][Index] * B[Index][Inner]; (*(matrix *)Res)[Outer][Inner] = tem; } this conditional might kill performance for you. The advantage is that this loop can now be more efficiently vectorized. > > > This loop is where most of the time is spent. But it would have been good if > I had access to h/w tracing to see if the interchanged loop reduces cache > misses as well as to see what is causing it to run this much slower. > > > > Thanks for your reply! > > > > From: Richard Biener > Date: Thursday, February 13, 2025 at 2:57 AM > To: Visda Vokhshoori - C51841 > Cc: gcc@gcc.gnu.org > Subject: Re: 22% degradation seen in embench:matmult-int > > [You don't often get email from richard.guent...@gmail.com. Learn why this is > important at https://aka.ms/LearnAboutSenderIdentification ] > > EXTERNAL EMAIL: Do not click links or open attachments unless you know the > content is safe > > On Wed, Feb 12, 2025 at 4:38 PM Visda.Vokhshoori--- via Gcc > wrote: > > > > Embench is used for benchmarking on embedded devices. > > This one project matmult-int has a function Multiply. It’s a matrix > > multiplication for 20 x 20 matrix. > > The device is a ATSAME70Q21B which is Cortex-M7 > > The compiler is arm branch based on GCC version 13 > > We are compiling with O3 which has loop-interchange pass on by default. > > > > When we compile with -fno-loop-interchange we get all 22% back plus 5% > > speed up. > > > > When we do the loop interchange on the one loop nest that get interchanged > > it is slightly (.7%) faster. > > > > Has anyone else seen large degradation as a result of loop interchange? > > I would suggest to compare the -fopt-info diagnostic output with and > without -fno-loop-interchange, > the interchanged loop might for example no longer vectorize. Other > than that - no, loop interchange > isn't applied very often and it has a very conservative cost model. > > Are you able to share a testcase? > > Richard. > > > > > Thanks
Re: 22% degradation seen in embench:matmult-int
“the interchanged loop might for example no longer vectorize.” The loops are not vectorized. Which is ok, because this device doesn’t have the support for it. I just don’t think a pass could single handedly make code slower that much. Loop interchange is supposed to interchange the loop nest index with outer index to improve cache locality. This is supposed to help -that is the next iteration we will have the data available in cache. The benchmark source –and the loop that gets interchanged is line 143 Source: https://github.com/embench/embench-iot/blob/master/src/matmult-int/matmult-int.c#L143 This loop is where most of the time is spent. But it would have been good if I had access to h/w tracing to see if the interchanged loop reduces cache misses as well as to see what is causing it to run this much slower. Thanks for your reply! From: Richard Biener Date: Thursday, February 13, 2025 at 2:57 AM To: Visda Vokhshoori - C51841 Cc: gcc@gcc.gnu.org Subject: Re: 22% degradation seen in embench:matmult-int [You don't often get email from richard.guent...@gmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ] EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe On Wed, Feb 12, 2025 at 4:38 PM Visda.Vokhshoori--- via Gcc wrote: > > Embench is used for benchmarking on embedded devices. > This one project matmult-int has a function Multiply. It’s a matrix > multiplication for 20 x 20 matrix. > The device is a ATSAME70Q21B which is Cortex-M7 > The compiler is arm branch based on GCC version 13 > We are compiling with O3 which has loop-interchange pass on by default. > > When we compile with -fno-loop-interchange we get all 22% back plus 5% speed > up. > > When we do the loop interchange on the one loop nest that get interchanged it > is slightly (.7%) faster. > > Has anyone else seen large degradation as a result of loop interchange? I would suggest to compare the -fopt-info diagnostic output with and without -fno-loop-interchange, the interchanged loop might for example no longer vectorize. Other than that - no, loop interchange isn't applied very often and it has a very conservative cost model. Are you able to share a testcase? Richard. > > Thanks