Attribute visibility

2020-03-16 Thread Visda.Vokhshoori--- via Gcc
 Hello,

In configuring gcc, there's a check to determine if the linker has support for 
STV_Hidden.  It determines support by checking for the date in the linker's 
version string.
On the other hand, the version string for the linker is constructed without a 
date, bfd/Makefile.am recipe for bfdver.h.
I am enclosing the lines I am referring to in this email.
It looks like there's a disconnect.  This results in incorrectly identifying a 
linker with support for STV_HIDDEN as one that doesn't.
This is from GCC 8.3.1 and Binutils 2.31.1.  Is this known/intended?

Thanks,
Visda

gcc/configure

if test $in_tree_ld != yes ; then
  ld_ver=`$gcc_cv_ld --version 2>/dev/null | sed 1q`
  if echo "$ld_ver" | grep GNU > /dev/null; then
if test x"$ld_is_gold" = xyes; then
  # GNU gold --version looks like this:
  # 
  # GNU gold (GNU Binutils 2.21.51.20110225) 1.11
  # 
  # We extract the binutils version which is more familiar and specific
  # than the gold version.
  ld_vers=`echo $ld_ver | sed -n \
  -e 's,^[^)]*[  ]\([0-9][0-9]*\.[0-9][0-9]*[^)]*\)) .*$,\1,p'`
else  
  # GNU ld --version looks like this:
  # 
  # GNU ld (GNU Binutils) 2.21.51.20110225
  ld_vers=`echo $ld_ver | sed -n \
  -e 's,^.*[ ]\([0-9][0-9]*\.[0-9][0-9]*.*\)$,\1,p'`
fi
ld_date=`echo $ld_ver | sed -n 
's,^.*\([2-9][0-9][0-9][0-9]\)[-]*\([01][0-9]\)[-]*\([0-3][0-9]\).*$,\1\2\3,p'` 
  <---looking for a -MM-DD
ld_vers_major=`expr "$ld_vers" : '\([0-9]*\)'`
ld_vers_minor=`expr "$ld_vers" : '[0-9]*\.\([0-9]*\)'`
ld_vers_patch=`expr "$ld_vers" : '[0-9]*\.[0-9]*\.\([0-9]*\)'`
  else  
case "${target}" in
  *-*-solaris2*)
# Solaris 2 ld -V output looks like this for a regular version:

...
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking linker for .hidden support" 
>&5
$as_echo_n "checking linker for .hidden support... " >&6; }
if test "${gcc_cv_ld_hidden+set}" = set; then :
  $as_echo_n "(cached) " >&6
else
  if test $in_tree_ld = yes ; then
  gcc_cv_ld_hidden=no
  if test "$gcc_cv_gld_major_version" -eq 2 -a "$gcc_cv_gld_minor_version" -ge 
13 -o "$gcc_cv_gld_major_version" -gt 2 \
 && test $in_tree_ld_is_elf = yes; then
 gcc_cv_ld_hidden=yes
  fi
else
  gcc_cv_ld_hidden=yes
  if test x"$ld_is_gold" = xyes; then
:
  elif echo "$ld_ver" | grep GNU > /dev/null; then
case "${target}" in
  mmix-knuth-mmixware)
# The linker emits by default mmo, not ELF, so "no" is appropriate.
gcc_cv_ld_hidden=no
;;
esac
if test 0"$ld_date" -lt 20020404; then  <- linker released 
before 20020404 doesn't support .hidden
  if test -n "$ld_date"; then
# If there was date string, but was earlier than 2002-04-04, fail
gcc_cv_ld_hidden=no


bfd/Makefile.am
the default in development.sh is development=false

bfdver.h: $(srcdir)/version.h $(srcdir)/development.sh $(srcdir)/Makefile.in
@echo "creating $@"
@bfd_version=`echo "$(VERSION)" | $(SED) -e 
's/\([^\.]*\)\.*\([^\.]*\)\.*\([^\.]*\)\.*\([^\.]*\)\.*\([^\.]*\).*/\1.00\2.00\3.00\4.00\5/'
 -e 's/\([^\.]*\)\..*\(..\)\..*\(..\)\..*\(..\)\..*\(..\)$$/\1\2\3\4\5/'` ;\
bfd_version_string="\"$(VERSION)\"" ;\
bfd_soversion="$(VERSION)" ;\
bfd_version_package="\"$(PKGVERSION)\"" ;\
report_bugs_to="\"$(REPORT_BUGS_TO)\"" ;\
. $(srcdir)/development.sh ;\
if test "$$development" = true ; then \
  bfd_version_date=`$(SED) -n -e 's/.*DATE //p' < $(srcdir)/version.h` 
;\
  bfd_version_string="\"$(VERSION).$${bfd_version_date}\"" ;\
  bfd_soversion="$(VERSION).$${bfd_version_date}" ;\
fi ;\
$(SED) -e "s,@bfd_version@,$$bfd_version," \
-e "s,@bfd_version_string@,$$bfd_version_string," \
-e "s,@bfd_version_package@,$$bfd_version_package," \
-e "s,@report_bugs_to@,$$report_bugs_to," \
< $(srcdir)/version.h > $@; \
echo "$${bfd_soversion}" > libtool-soversion



-fstack-usage and -flto the stack usage report is generated

2021-01-29 Thread Visda.Vokhshoori--- via Gcc

Hello,

With link time optimization the stack usage information is determined during 
local transformation and written to program.ltrans0.ltrans.su.  There will be 
one .su file for each partition.

All ltrans files, including the .su, are removed unless –save-temps is 
indicated.

Although not obvious/not documented, but I am assuming this is working as 
designed.  Developers should include –save-temps along with -fstack-suage and 
-flto.

Or is this a bug?  That is exception should be made for the ltrans.su files and 
they should be kept around regardless.

Thanks,
Visda


Re: -fstack-usage and -flto the stack usage report is generated

2021-02-01 Thread Visda.Vokhshoori--- via Gcc
On 2021-02-01, 3:19 AM, "Martin Liška"  wrote:

EXTERNAL EMAIL: Do not click links or open attachments unless you know the 
content is safe

On 1/29/21 3:57 PM, Visda.Vokhshoori--- via Gcc wrote:
>
> Hello,

Hi.

>
> With link time optimization the stack usage information is determined 
during local transformation and written to program.ltrans0.ltrans.su.  There 
will be one .su file for each partition.
>
> All ltrans files, including the .su, are removed unless –save-temps is 
indicated.
>
> Although not obvious/not documented, but I am assuming this is working as 
designed.  Developers should include –save-temps along with -fstack-suage and 
-flto.

>>Thank you for the report. It seems it's a bug, please report it to 
bugzilla.

 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98922

>>Martin

>
> Or is this a bug?  That is exception should be made for the ltrans.su 
files and they should be kept around regardless.
>
> Thanks,
> Visda
>

Thanks,
Visda



22% degradation seen in embench:matmult-int

2025-02-12 Thread Visda.Vokhshoori--- via Gcc
Embench is used for benchmarking on embedded devices.
This one project matmult-int has a function Multiply.  It’s a matrix 
multiplication for 20 x 20 matrix.
The device is a ATSAME70Q21B which is Cortex-M7
The compiler is arm branch based on GCC version 13
We are compiling with O3 which has loop-interchange pass on by default.

When we compile with -fno-loop-interchange we get all 22% back plus 5% speed up.

When we do the loop interchange on the one loop nest that get interchanged it 
is slightly (.7%) faster.

Has anyone else seen large degradation as a result of loop interchange?

Thanks


Re: 22% degradation seen in embench:matmult-int

2025-02-12 Thread Visda.Vokhshoori--- via Gcc
* When we do the loop interchange on the one loop nest that get interchanged in 
the program source it is slightly (.7%) faster.


From: Gcc  on behalf of 
Visda.Vokhshoori--- via Gcc 
Date: Wednesday, February 12, 2025 at 10:38 AM
To: gcc@gcc.gnu.org 
Subject: 22% degradation seen in embench:matmult-int
EXTERNAL EMAIL: Do not click links or open attachments unless you know the 
content is safe

Embench is used for benchmarking on embedded devices.
This one project matmult-int has a function Multiply.  It’s a matrix 
multiplication for 20 x 20 matrix.
The device is a ATSAME70Q21B which is Cortex-M7
The compiler is arm branch based on GCC version 13
We are compiling with O3 which has loop-interchange pass on by default.

When we compile with -fno-loop-interchange we get all 22% back plus 5% speed up.

When we do the loop interchange on the one loop nest that get interchanged it 
is slightly (.7%) faster.

Has anyone else seen large degradation as a result of loop interchange?

Thanks


Re: 22% degradation seen in embench:matmult-int

2025-02-14 Thread Visda.Vokhshoori--- via Gcc
“tem = Index == 0 ? 0 : (*(matrix *)Res)[Outer][Inner];”

When I compared the assembly statement of the loop these extra statements are 
in the most inner loop.

400968:   2800cmp r0, #0

  400972:   bf08it  eq
  400974:   2300moveq   r3, #0

R3 being Res.
It is the statement you have above.

“a CPU uarch with caches and HW prefetching where linear accesses are a lot more
efficient than strided ones - that might not hold at all for the
Cortex-M7.”

Yes that’s it.

Thanks a lot for your help!

From: Richard Biener 
Date: Friday, February 14, 2025 at 2:26 AM
To: Visda Vokhshoori - C51841 
Cc: gcc@gcc.gnu.org 
Subject: Re: 22% degradation seen in embench:matmult-int
[You don't often get email from richard.guent...@gmail.com. Learn why this is 
important at https://aka.ms/LearnAboutSenderIdentification ]

EXTERNAL EMAIL: Do not click links or open attachments unless you know the 
content is safe

On Thu, Feb 13, 2025 at 9:30 PM  wrote:
>
>
>
> “the interchanged loop might for example no longer vectorize.”
>
>
>
> The loops are not vectorized.  Which is ok, because this device doesn’t have 
> the support for it.
>
> I just don’t think a pass could single handedly make code slower that much.
>
>
>
> Loop interchange is supposed to interchange the loop nest index with outer 
> index to improve cache locality.  This is supposed to help -that is the next 
> iteration we will have the data available in cache.
>
>
>
> The benchmark source –and  the loop that gets interchanged is line 143
>
>
>
> Source: 
> https://github.com/embench/embench-iot/blob/master/src/matmult-int/matmult-int.c#L143

Looks like the classical matmul loop, similar to the one in SPEC CPU
bwaves.  We do
apply interchange here and that looks reasonable to me.  Note
interchange assumes
a CPU uarch with caches and HW prefetching where linear accesses are a lot more
efficient than strided ones - that might not hold at all for the
Cortex-M7.  Without
interchange the store to Res[] can be moved out of the inner loop.

I've tried

#define UPPERLIMIT 20
typedef long matrix[UPPERLIMIT][UPPERLIMIT];
void
Multiply (matrix A, matrix B, long * __restrict Res)
{
  register int Outer, Inner, Index;

  for (Outer = 0; Outer < UPPERLIMIT; Outer++)
for (Inner = 0; Inner < UPPERLIMIT; Inner++)
  {
(*(matrix *)Res)[Outer][Inner] = 0;
for (Index = 0; Index < UPPERLIMIT; Index++)
  (*(matrix *)Res)[Outer][Inner] += A[Outer][Index] * B[Index][Inner];
  }
}

and this is interchanged on x86_64 as well.  We are implementing a trick
for the zeroing which, when moved into innermost position is done as

  for (Index = 0; Index < UPPERLIMIT; Index++)
for (Inner = 0; Inner < UPPERLIMIT; Inner++)
   {
  tem = Index == 0 ? 0 : (*(matrix *)Res)[Outer][Inner];
  tem += A[Outer][Index] * B[Index][Inner];
  (*(matrix *)Res)[Outer][Inner] = tem;
   }

this conditional might kill performance for you.  The advantage is that this
loop can now be more efficiently vectorized.



>
>
> This loop is where most of the time is spent. But it would have been good if 
> I had access to h/w tracing to see if the interchanged loop reduces cache 
> misses as well as to see what is causing it to run this much slower.
>
>
>
> Thanks for your reply!
>
>
>
> From: Richard Biener 
> Date: Thursday, February 13, 2025 at 2:57 AM
> To: Visda Vokhshoori - C51841 
> Cc: gcc@gcc.gnu.org 
> Subject: Re: 22% degradation seen in embench:matmult-int
>
> [You don't often get email from richard.guent...@gmail.com. Learn why this is 
> important at https://aka.ms/LearnAboutSenderIdentification ]
>
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the 
> content is safe
>
> On Wed, Feb 12, 2025 at 4:38 PM Visda.Vokhshoori--- via Gcc
>  wrote:
> >
> > Embench is used for benchmarking on embedded devices.
> > This one project matmult-int has a function Multiply.  It’s a matrix 
> > multiplication for 20 x 20 matrix.
> > The device is a ATSAME70Q21B which is Cortex-M7
> > The compiler is arm branch based on GCC version 13
> > We are compiling with O3 which has loop-interchange pass on by default.
> >
> > When we compile with -fno-loop-interchange we get all 22% back plus 5% 
> > speed up.
> >
> > When we do the loop interchange on the one loop nest that get interchanged 
> > it is slightly (.7%) faster.
> >
> > Has anyone else seen large degradation as a result of loop interchange?
>
> I would suggest to compare the -fopt-info diagnostic output with and
> without -fno-loop-interchange,
> the interchanged loop might for example no longer vectorize.  Other
> than that - no, loop interchange
> isn't applied very often and it has a very conservative cost model.
>
> Are you able to share a testcase?
>
> Richard.
>
> >
> > Thanks


Re: 22% degradation seen in embench:matmult-int

2025-02-13 Thread Visda.Vokhshoori--- via Gcc

“the interchanged loop might for example no longer vectorize.”

The loops are not vectorized.  Which is ok, because this device doesn’t have 
the support for it.
I just don’t think a pass could single handedly make code slower that much.

Loop interchange is supposed to interchange the loop nest index with outer 
index to improve cache locality.  This is supposed to help -that is the next 
iteration we will have the data available in cache.

The benchmark source –and  the loop that gets interchanged is line 143

Source: 
https://github.com/embench/embench-iot/blob/master/src/matmult-int/matmult-int.c#L143

This loop is where most of the time is spent. But it would have been good if I 
had access to h/w tracing to see if the interchanged loop reduces cache misses 
as well as to see what is causing it to run this much slower.

Thanks for your reply!

From: Richard Biener 
Date: Thursday, February 13, 2025 at 2:57 AM
To: Visda Vokhshoori - C51841 
Cc: gcc@gcc.gnu.org 
Subject: Re: 22% degradation seen in embench:matmult-int
[You don't often get email from richard.guent...@gmail.com. Learn why this is 
important at https://aka.ms/LearnAboutSenderIdentification ]

EXTERNAL EMAIL: Do not click links or open attachments unless you know the 
content is safe

On Wed, Feb 12, 2025 at 4:38 PM Visda.Vokhshoori--- via Gcc
 wrote:
>
> Embench is used for benchmarking on embedded devices.
> This one project matmult-int has a function Multiply.  It’s a matrix 
> multiplication for 20 x 20 matrix.
> The device is a ATSAME70Q21B which is Cortex-M7
> The compiler is arm branch based on GCC version 13
> We are compiling with O3 which has loop-interchange pass on by default.
>
> When we compile with -fno-loop-interchange we get all 22% back plus 5% speed 
> up.
>
> When we do the loop interchange on the one loop nest that get interchanged it 
> is slightly (.7%) faster.
>
> Has anyone else seen large degradation as a result of loop interchange?

I would suggest to compare the -fopt-info diagnostic output with and
without -fno-loop-interchange,
the interchanged loop might for example no longer vectorize.  Other
than that - no, loop interchange
isn't applied very often and it has a very conservative cost model.

Are you able to share a testcase?

Richard.

>
> Thanks