[Bug tree-optimization/47059] compiler fails to coalesce loads/stores

2013-09-18 Thread vda.linux at googlemail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47059

Denis Vlasenko  changed:

   What|Removed |Added

 CC||vda.linux at googlemail dot com

--- Comment #3 from Denis Vlasenko  ---
I encountered this behavior with 4.8.0:

struct pollfd pfd[3];
...
pfd[2].events = POLLOUT;
pfd[2].revents = 0;

This compiled to:

movw$4, 44(%rsp)#, pfd[2].events
movw$0, 46(%rsp)#, pfd[2].revents


[Bug middle-end/66240] RFE: extend -falign-xyz syntax

2018-05-21 Thread vda.linux at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66240

--- Comment #7 from Denis Vlasenko  ---
Patch v8

https://gcc.gnu.org/ml/gcc-patches/2017-04/msg00792.html
https://gcc.gnu.org/ml/gcc-patches/2017-04/msg00793.html
https://gcc.gnu.org/ml/gcc-patches/2017-04/msg00794.html
https://gcc.gnu.org/ml/gcc-patches/2017-04/msg00795.html

[Bug target/45996] -falign-functions=X does not work

2018-05-21 Thread vda.linux at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=45996

Denis Vlasenko  changed:

   What|Removed |Added

 CC||vda.linux at googlemail dot com

--- Comment #8 from Denis Vlasenko  ---
See bug 66240

[Bug rtl-optimization/21182] gcc can use registers but uses stack instead

2013-01-17 Thread vda.linux at googlemail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21182



--- Comment #6 from Denis Vlasenko  2013-01-18 
00:48:23 UTC ---

Created attachment 29200

  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=29200

Updated testcase, build heper, and results of testing with different gcc

versions



Tarball contains:



serpent.c:

the original testcase, only with "#ifdef NAIL_REGS" instead of "#if 0" which

allows test compiles w/o editing it. Basically, "gcc -DNAIL_REGS serpent.c"

will try to force gcc to use only registers instead of stack.



gencode.sh:

builds serpent.c with -O2 and -O3, with and without -DNAIL_REGS. The object

file names contain gcc version and used options. Then they are objdump'ed and

output saved. Tweakable with setting $PREFIX and/or $CC.

No -fomit-frame-pointer used: the testcase can be compiled so that stack is not

used even without that option.



Disassembly:

serpent-O2-3.4.3.asm

serpent-O2-4.2.1.asm

serpent-O2-4.6.3.asm

serpent-O2-DNAIL_REGS-3.4.3.asm

serpent-O2-DNAIL_REGS-4.2.1.asm

serpent-O2-DNAIL_REGS-4.6.3.asm

serpent-O3-3.4.3.asm

serpent-O3-4.2.1.asm

serpent-O3-4.6.3.asm

serpent-O3-DNAIL_REGS-3.4.3.asm

serpent-O3-DNAIL_REGS-4.2.1.asm

serpent-O3-DNAIL_REGS-4.6.3.asm



Object files:

   textdata bss dec hex filename

   3260   0   03260 cbc serpent-O2-DNAIL_REGS-3.4.3.o

   3260   0   03260 cbc serpent-O3-DNAIL_REGS-3.4.3.o

   3292   0   03292 cdc serpent-O3-3.4.3.o

   3536   0   03536 dd0 serpent-O2-4.6.3.o

   3536   0   03536 dd0 serpent-O3-4.6.3.o

   3845   0   03845 f05 serpent-O2-DNAIL_REGS-4.6.3.o

   3845   0   03845 f05 serpent-O3-DNAIL_REGS-4.6.3.o

   3877   0   03877 f25 serpent-O2-4.2.1.o

   3877   0   03877 f25 serpent-O3-4.2.1.o

   4302   0   0430210ce serpent-O2-3.4.3.o

   4641   0   046411221 serpent-O2-DNAIL_REGS-4.2.1.o

   4641   0   046411221 serpent-O3-DNAIL_REGS-4.2.1.o



Take a look inside serpent-O2-DNAIL_REGS-3.4.3.asm file.

This is what I want to get without asm hacks: the smallest code, uses no stack.



gcc-3.4.3 -O3 comes close: it does spill a few words to stack (search for

(%ebp)), but is generally good code (close to ideal?).



All other attempts fare worse:



gcc-3.4.3 -O2: code is significantly worse than -O3.

gcc-4.2.1 -O2/-O3: code is better than gcc-3.4.3 -O2, worse than gcc-4.6.3

gcc-4.6.3 -O2/-O3: six instances of spills to stack . Code is still not as good

as gcc-3.4.3 -O3. (-DNAIL_REGS only confuses it more, unlike 3.4.3).



Stack usage summary:



$ grep 'sub.*,%esp' *.asm | grep -v DNAIL_REGS

serpent-O2-3.4.3.asm:   6:  81 ec 00 01 00 00   sub$0x100,%esp

serpent-O2-4.2.1.asm:   6:  83 ec 78sub$0x78,%esp

serpent-O2-4.6.3.asm:   4:  83 ec 04sub$0x4,%esp

serpent-O3-4.2.1.asm:   6:  83 ec 78sub$0x78,%esp

serpent-O3-4.6.3.asm:   4:  83 ec 04sub$0x4,%esp



(serpent-O3-3.4.3.asm is not listed, but it allocates and uses one word on

stack by push insn).





Modules with best (= minimal) stack usage:



$ grep -F -e '(%esp)' -e '(%ebp)' serpent-O2-DNAIL_REGS-3.4.3.asm

   6:   8b 75 08mov0x8(%ebp),%esi

   9:   8b 7d 10mov0x10(%ebp),%edi

 ca9:   8b 75 0cmov0xc(%ebp),%esi



$ grep -F -e '(%esp)' -e '(%ebp)' serpent-O3-3.4.3.asm

   7:   8b 7d 08mov0x8(%ebp),%edi

   a:   8b 4d 10mov0x10(%ebp),%ecx

 18c:   89 7d f0mov%edi,-0x10(%ebp)

 1dd:   8b 45 f0mov-0x10(%ebp),%eax

 23b:   8b 75 f0mov-0x10(%ebp),%esi

 299:   8b 7d f0mov-0x10(%ebp),%edi

 432:   8b 55 f0mov-0x10(%ebp),%edx

 4a0:   8b 4d f0mov-0x10(%ebp),%ecx

 50e:   8b 7d f0mov-0x10(%ebp),%edi

 84f:   8b 45 f0mov-0x10(%ebp),%eax

 8b9:   8b 75 f0mov-0x10(%ebp),%esi

 923:   8b 7d f0mov-0x10(%ebp),%edi

 cb6:   8b 55 0cmov0xc(%ebp),%edx



$ grep -F -e '(%esp)' -e '(%ebp)' serpent-O3-4.6.3.asm

   7:   8b 4c 24 20 mov0x20(%esp),%ecx

   b:   8b 44 24 18 mov0x18(%esp),%eax

 22e:   89 0c 24mov%ecx,(%esp)

 239:   23 3c 24and(%esp),%edi

 588:   89 0c 24mov%ecx,(%esp)

 58f:   23 3c 24and(%esp),%edi

 8f4:   89 0c 24mov%ecx,(%esp)

 8fd:   23 3c 24and(%esp),%edi

 c60:   89 0c 24mov%ecx,(%esp)

 c6b:   23 3c 24and(%esp),%edi

 d37:   89 14 24mov%edx,(%esp)

 d5a:   8b 44 24 1c mov0x1c(%esp),%eax

 d5e:   33 14 24 

[Bug rtl-optimization/21182] gcc can use registers but uses stack instead

2013-01-17 Thread vda.linux at googlemail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21182



Denis Vlasenko  changed:



   What|Removed |Added



 CC||vda.linux at googlemail dot

   ||com



--- Comment #7 from Denis Vlasenko  2013-01-18 
00:51:01 UTC ---

"gcc-4.6.3 got better a bit, still not as good as gcc-4.6.3 -O3."



I meant:



gcc-4.6.3 got better a bit, still not as good as gcc-3.4.3 -O3 used to be.


[Bug rtl-optimization/21182] gcc can use registers but uses stack instead

2013-01-17 Thread vda.linux at googlemail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21182



--- Comment #8 from Denis Vlasenko  2013-01-18 
00:55:37 UTC ---

Grrr, another mistake. Correcting again:



Conclusion:

gcc-3.4.3 -O3 was close to ideal.

^

gcc-4.2.1 is worse.

gcc-4.6.3 got better a bit, still not as good as gcc-3.4.3 -O3 used to be.

 ^


[Bug target/30354] -Os doesn't optimize a/CONST even if it saves size.

2013-01-18 Thread vda.linux at googlemail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=30354



--- Comment #16 from Denis Vlasenko  
2013-01-18 10:29:12 UTC ---

(In reply to comment #15)

> Honza, did you find time to have a look?

> 

> I think this regressed alot in 4.6



Not really - it's just .eh_frame section.

I re-ran the tests with two gcc's I have here and sizes look like this:



   text   databssdechexfilename

 257731  0  0 257731  3eec3divmod-4.2.1-Os.o

 242787  0  0 242787  3b463divmod-4.6.3-Os.o



Stock (unpatched) gcc improved, juggles registers better. For example:

int ib_100_x(int x) { return (100 / x) ^ (100 % x); }

0:  b8 64 00 00 00  mov$0x64,%eax

5:  99  cltd

6:  f7 7c 24 04 idivl  0x4(%esp)

-   a:  31 c2   xor%eax,%edx

-   c:  89 d0   mov%edx,%eax

-   e:  c3  ret

+   a:  31 d0   xor%edx,%eax

+   c:  c3  ret



I believe my patch would improve things still - it is orthogonal to register

allocation.



BTW, just so that we are all on the same page wrt compiler options:

here's the script I use to compile, disassemble, and extract function sizes

from test program in comment 3. Tweakable by setting $PREFIX and/or $CC:



gencode.sh

==

#!/bin/sh



#PREFIX="i686-"



test "$PREFIX"  || PREFIX=""

test "$CC"  || CC="${PREFIX}gcc"

test "$OBJDUMP" || OBJDUMP="${PREFIX}objdump"

test "$NM"  || NM="${PREFIX}nm"



CC_VER=`$CC --version | sed -n 's/[^ ]* [^ ]* \([3-9]\.[1-9][^ ]*\).*/\1/p'`

test "$CC_VER" || exit 1



build()

{

opt=$1

bname=divmod-$CC_VER${opt}${nail}



# -ffunction-sections makes disasm easier to understand

# (insn offsets start from 0 within every function).

# -fno-exceptions -fno-asynchronous-unwind-tables: die, .eh_frame, die!

$CC \

-m32 \

-fomit-frame-pointer \

-ffunction-sections \

-fno-exceptions \

-fno-asynchronous-unwind-tables \

${opt} t.c -c -o $bname.o \

&& $OBJDUMP -dr $bname.o >$bname.asm \

&& $NM --size-sort $bname.o  | sort -k3 >$bname.nm

}



build -Os

#build -O2  #not interesting

#build -O3  #not interesting

size *.o | tee SIZES


[Bug rtl-optimization/21150] Suboptimal byte extraction from 64bits

2013-01-18 Thread vda.linux at googlemail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21150



Denis Vlasenko  changed:



   What|Removed |Added



 CC||vda.linux at googlemail dot

   ||com



--- Comment #6 from Denis Vlasenko  2013-01-18 
11:12:18 UTC ---

Guess this can be closed now. All four cases look good:



$ cat helper-4.6.3-O2.asm 

helper-4.6.3-O2.o: file format elf32-i386

...

 :

   0:0f b6 05 2d 00 00 00 movzbl 0x2d,%eax

   7:32 05 24 00 00 00xor0x24,%al

   d:32 05 00 00 00 00xor0x0,%al

  13:32 05 36 00 00 00xor0x36,%al

  19:32 05 3f 00 00 00xor0x3f,%al

  1f:32 05 09 00 00 00xor0x9,%al

  25:32 05 12 00 00 00xor0x12,%al

  2b:32 05 1b 00 00 00xor0x1b,%al

  31:c3   ret

Disassembly of section .text.b:

 :

   0:0f b6 05 12 00 00 00 movzbl 0x12,%eax

   7:32 05 09 00 00 00xor0x9,%al

   d:32 05 00 00 00 00xor0x0,%al

  13:32 05 1b 00 00 00xor0x1b,%al

  19:32 05 24 00 00 00xor0x24,%al

  1f:32 05 2d 00 00 00xor0x2d,%al

  25:32 05 36 00 00 00xor0x36,%al

  2b:32 05 3f 00 00 00xor0x3f,%al

  31:c3   ret

Disassembly of section .text.c:

 :

   0:0f b6 05 09 00 00 00 movzbl 0x9,%eax

   7:32 05 00 00 00 00xor0x0,%al

   d:32 05 12 00 00 00xor0x12,%al

  13:32 05 1b 00 00 00xor0x1b,%al

  19:32 05 24 00 00 00xor0x24,%al

  1f:32 05 2d 00 00 00xor0x2d,%al

  25:32 05 36 00 00 00xor0x36,%al

  2b:32 05 3f 00 00 00xor0x3f,%al

  31:c3   ret

Disassembly of section .text.d:

 :

   0:0f b6 05 12 00 00 00 movzbl 0x12,%eax

   7:32 05 09 00 00 00xor0x9,%al

   d:32 05 00 00 00 00xor0x0,%al

  13:32 05 1b 00 00 00xor0x1b,%al

  19:32 05 24 00 00 00xor0x24,%al

  1f:32 05 2d 00 00 00xor0x2d,%al

  25:32 05 36 00 00 00xor0x36,%al

  2b:32 05 3f 00 00 00xor0x3f,%al

  31:c3   ret



Curiously, -Os manages to squeeze two more bytes out of it.



helper-4.6.3-Os.o: file format elf32-i386

 :

   0:   a0 2d 00 00 00  mov0x2d,%al

^^  ^^^ better than movzbl

   5:   33 05 24 00 00 00   xor0x24,%eax  << why %eax? oh well...

   b:   33 05 00 00 00 00   xor0x0,%eax

  11:   32 05 36 00 00 00   xor0x36,%al

  17:   32 05 3f 00 00 00   xor0x3f,%al

  1d:   32 05 09 00 00 00   xor0x9,%al

  23:   32 05 12 00 00 00   xor0x12,%al

  29:   32 05 1b 00 00 00   xor0x1b,%al

  2f:   c3  ret


[Bug rtl-optimization/21141] [3.4 Regression] excessive stack usage

2013-01-18 Thread vda.linux at googlemail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21141



Denis Vlasenko  changed:



   What|Removed |Added



 CC||vda.linux at googlemail dot

   ||com



--- Comment #9 from Denis Vlasenko  2013-01-18 
16:01:52 UTC ---

Current gcc seems to be doing fine:



$ grep 'sub.*,%esp' *.asm; size *.o

whirlpool-4.2.1-O2.asm:   81 ec 84 01 00 00sub$0x184,%esp

whirlpool-4.2.1-O3.asm:   81 ec 4c 01 00 00sub$0x14c,%esp

whirlpool-4.2.1-Os.asm:   81 ec 84 01 00 00sub$0x184,%esp

whirlpool-4.6.3-O2.asm:   81 ec 4c 01 00 00sub$0x14c,%esp

whirlpool-4.6.3-O3.asm:   81 ec 4c 01 00 00sub$0x14c,%esp

whirlpool-4.6.3-Os.asm:   81 ec 4c 01 00 00sub$0x14c,%esp

   text   databssdechexfilename

   6223  0  0   6223   184fwhirlpool-4.2.1-O2.o

   5663  0  0   5663   161fwhirlpool-4.2.1-O3.o

   6194  0  0   6194   1832whirlpool-4.2.1-Os.o

   5655  0  0   5655   1617whirlpool-4.6.3-O2.o

   5703  0  0   5703   1647whirlpool-4.6.3-O3.o

   5570  0  0   5570   15c2whirlpool-4.6.3-Os.o


[Bug rtl-optimization/21141] [3.4 Regression] excessive stack usage

2013-01-18 Thread vda.linux at googlemail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21141



--- Comment #10 from Denis Vlasenko  
2013-01-18 16:03:37 UTC ---

BTW, testcase needs a small fix:



-static const u64 C0[256];

+u64 C0[256];



or else gcc with optimize it almost to nothing :)


[Bug rtl-optimization/21182] [4.6/4.7/4.8 Regression] gcc can use registers but uses stack instead

2013-01-20 Thread vda.linux at googlemail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21182



--- Comment #11 from Denis Vlasenko  
2013-01-20 14:39:42 UTC ---

(In reply to comment #10)

> 4.4.7 and 4.5.4 generate the same code (no stack use) for -D/-UNAIL_REGS.

> With 4.6.3, the -DNAIL_REGS code regresses very much (IRA ...), the

> -UNAIL_REGS code is nearly perfect but less good than 4.4/4.5 (if you

> only consider grep esp serpent.s | wc -l).  Same behavior with 4.7.2.

> 

> Trunk got somewhat worse with -UNAIL_REGS but better with -DNAIL_REGS (at 
> -O2):

> 

>  -UNAIL_REGS  -DNAIL_REGS

> 4.5.4 3 3

> 4.6.315   101



This matches what I see with 4.6.3 - 15 insns with %esp (and no %ebp):



$ grep '%esp' serpent-4.6.3-O2.asm

   4:   83 ec 04sub$0x4,%esp

   7:   8b 4c 24 20 mov0x20(%esp),%ecx

   b:   8b 44 24 18 mov0x18(%esp),%eax

 22e:   89 0c 24mov%ecx,(%esp)

 239:   23 3c 24and(%esp),%edi

 588:   89 0c 24mov%ecx,(%esp)

 58f:   23 3c 24and(%esp),%edi

 8f4:   89 0c 24mov%ecx,(%esp)

 8fd:   23 3c 24and(%esp),%edi

 c60:   89 0c 24mov%ecx,(%esp)

 c6b:   23 3c 24and(%esp),%edi

 d37:   89 14 24mov%edx,(%esp)

 d5a:   8b 44 24 1c mov0x1c(%esp),%eax

 d5e:   33 14 24xor(%esp),%edx

 d70:   83 c4 04add$0x4,%esp



> The most important thing to fix is the -UNAIL_REGS case of course.



Sure. NAIL_REGS is only a hack meant to demonstrate that regs *can* be

allocated optimally.


[Bug c/70646] Corrupt truncated function

2016-04-13 Thread vda.linux at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70646

Denis Vlasenko  changed:

   What|Removed |Added

 CC||vda.linux at googlemail dot com

--- Comment #3 from Denis Vlasenko  ---
I can reproduce it with:

$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/5.3.1/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --enable-bootstrap
--enable-languages=c,c++,objc,obj-c++,fortran,ada,go,lto --prefix=/usr
--mandir=/usr/share/man --infodir=/usr/share/info
--with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared
--enable-threads=posix --enable-checking=release --enable-multilib
--with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions
--enable-gnu-unique-object --enable-linker-build-id
--with-linker-hash-style=gnu --enable-plugin --enable-initfini-array
--disable-libgcj --with-isl --enable-libmpx --enable-gnu-indirect-function
--with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux
Thread model: posix
gcc version 5.3.1 20160406 (Red Hat 5.3.1-6) (GCC) 

No fancy compiler flags are necessary to thigger it.

Without "-fno-omit-frame-pointer", function loses its two remaining insns, I
see an empty body:

.type   qla2x00_get_host_fabric_name, @function
qla2x00_get_host_fabric_name:
.LFB4504:
.cfi_startproc
.cfi_endproc
.LFE4504:
.size   qla2x00_get_host_fabric_name, .-qla2x00_get_host_fabric_name

Simple "gcc -Os qla_attr.i.c -S" would do.

gcc -O2 produces a normally-looking function.

[Bug c/70646] Corrupt truncated function

2016-04-13 Thread vda.linux at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70646

--- Comment #4 from Denis Vlasenko  ---
Shorter reproducer:

typedef __signed__ char __s8;
typedef unsigned char __u8;
typedef __signed__ short __s16;
typedef unsigned short __u16;
typedef __signed__ int __s32;
typedef unsigned int __u32;
__extension__ typedef __signed__ long long __s64;
__extension__ typedef unsigned long long __u64;
typedef signed char s8;
typedef unsigned char u8;
typedef signed short s16;
typedef unsigned short u16;
typedef signed int s32;
typedef unsigned int u32;
typedef signed long long s64;
typedef unsigned long long u64;
typedef __u64 __be64;
static inline __attribute__((no_instrument_function))
__attribute__((__const__)) __u64 __fswab64(__u64 val)
{
 return __builtin_bswap64(val);
}
static inline __attribute__((no_instrument_function))
__attribute__((always_inline)) __u64 __swab64p(const __u64 *p)
{
 return (__builtin_constant_p((__u64)(*p)) ? ((__u64)( (((__u64)(*p) &
(__u64)0x00ffULL) << 56) | (((__u64)(*p) &
(__u64)0xff00ULL) << 40) | (((__u64)(*p) &
(__u64)0x00ffULL) << 24) | (((__u64)(*p) &
(__u64)0xff00ULL) << 8) | (((__u64)(*p) &
(__u64)0x00ffULL) >> 8) | (((__u64)(*p) &
(__u64)0xff00ULL) >> 24) | (((__u64)(*p) &
(__u64)0x00ffULL) >> 40) | (((__u64)(*p) &
(__u64)0xff00ULL) >> 56))) : __fswab64(*p));
}
static inline __attribute__((no_instrument_function))
__attribute__((always_inline)) __u64 __be64_to_cpup(const __be64 *p)
{
 return __swab64p((__u64 *)p);
}
static inline __attribute__((no_instrument_function))
__attribute__((always_inline)) u64 get_unaligned_be64(const void *p)
{
 return __be64_to_cpup((__be64 *)p);
}
static inline __attribute__((no_instrument_function)) u64 wwn_to_u64(u8 *wwn)
{
 return get_unaligned_be64(wwn);
}

struct Scsi_Host {
 unsigned long base;
 unsigned long io_port;
 unsigned char n_io_port;
 unsigned char dma_channel;
 unsigned int irq;
 void *shost_data;
 unsigned long hostdata[0]
  __attribute__ ((aligned (sizeof(unsigned long;
};
static inline __attribute__((no_instrument_function)) void *shost_priv(struct
Scsi_Host *shost)
{
 return (void *)shost->hostdata;
}
typedef struct scsi_qla_host {
 u8 fabric_node_name[8];
 u32 device_flags;
} scsi_qla_host_t;
struct fc_host_attrs {
 u64 node_name;
 u64 port_name;
 u64 permanent_port_name;
 u32 supported_classes;
 u8 supported_fc4s[32];
 u32 supported_speeds;
 u32 maxframe_size;
 u16 max_npiv_vports;
 char serial_number[80];
 char manufacturer[80];
 char model[256];
 char model_description[256];
 char hardware_version[64];
 char driver_version[64];
 char firmware_version[64];
 char optionrom_version[64];
 u32 port_id;
 u8 active_fc4s[32];
 u32 speed;
 u64 fabric_name;
};

static void
qla2x00_get_host_fabric_name(struct Scsi_Host *shost)
{
 scsi_qla_host_t *vha = shost_priv(shost);
 u8 node_name[8] = { 0xFF, 0xFF, 0xFF, 0xFF,
  0xFF, 0xFF, 0xFF, 0xFF};
 u64 fabric_name = wwn_to_u64(node_name);

 if (vha->device_flags & 0x1)
  fabric_name = wwn_to_u64(vha->fabric_node_name);

 (((struct fc_host_attrs *)(shost)->shost_data)->fabric_name) = fabric_name;
}

void *get_host_fabric_name = qla2x00_get_host_fabric_name;

[Bug c/70646] Corrupt truncated function

2016-04-13 Thread vda.linux at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70646

--- Comment #5 from Denis Vlasenko  ---
Even smaller reproducer.

Bug disappears if "__attribute__((always_inline))" is removed everywhere.


typedef unsigned char u8;
typedef unsigned int u32;
typedef unsigned long long u64;
static inline __attribute__((__const__)) u64 __fswab64(u64 val)
{
 return __builtin_bswap64(val);
}
static inline __attribute__((always_inline)) u64 __swab64p(const u64 *p)
{
 return (__builtin_constant_p((u64)(*p)) ? ((u64)( (((u64)(*p) &
(u64)0x00ffULL) << 56) | (((u64)(*p) & (u64)0xff00ULL)
<< 40) | (((u64)(*p) & (u64)0x00ffULL) << 24) | (((u64)(*p) &
(u64)0xff00ULL) << 8) | (((u64)(*p) & (u64)0x00ffULL)
>> 8) | (((u64)(*p) & (u64)0xff00ULL) >> 24) | (((u64)(*p) &
(u64)0x00ffULL) >> 40) | (((u64)(*p) & (u64)0xff00ULL)
>> 56))) : __fswab64(*p));
}
static inline __attribute__((always_inline)) u64 __be64_to_cpup(const u64 *p)
{
 return __swab64p((u64 *)p);
}
static inline __attribute__((always_inline)) u64 get_unaligned_be64(const void
*p)
{
 return __be64_to_cpup((u64 *)p);
}
static inline u64 wwn_to_u64(u8 *wwn)
{
 return get_unaligned_be64(wwn);
}

struct Scsi_Host {
 void *shost_data;
 unsigned long hostdata[0];
};
static inline void *shost_priv(struct Scsi_Host *shost)
{
 return (void *)shost->hostdata;
}
typedef struct scsi_qla_host {
 u8 fabric_node_name[8];
 u32 device_flags;
} scsi_qla_host_t;
struct fc_host_attrs {
 u64 fabric_name;
};

static void
qla2x00_get_host_fabric_name(struct Scsi_Host *shost)
{
 scsi_qla_host_t *vha = shost_priv(shost);
 u8 node_name[8] = { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF};
 u64 fabric_name = wwn_to_u64(node_name);
 if (vha->device_flags & 0x1)
  fabric_name = wwn_to_u64(vha->fabric_node_name);
 (((struct fc_host_attrs *)(shost)->shost_data)->fabric_name) = fabric_name;
}

void *get_host_fabric_name = qla2x00_get_host_fabric_name;

[Bug c/70646] Corrupt truncated function

2016-04-13 Thread vda.linux at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70646

--- Comment #6 from Denis Vlasenko  ---
I can collapse the chain of inlines down to this and still see the bug.
Removing "__attribute__((always_inline))", or merging __swab64p() and
wwn_to_u64(), makes bug disappear.


typedef unsigned char u8;
typedef unsigned int u32;
typedef unsigned long long u64;
static inline __attribute__((always_inline)) u64 __swab64p(const u64 *p)
{
 return (__builtin_constant_p((u64)(*p)) ? ((u64)( (((u64)(*p) &
(u64)0x00ffULL) << 56) | (((u64)(*p) & (u64)0xff00ULL)
<< 40) | (((u64)(*p) & (u64)0x00ffULL) << 24) | (((u64)(*p) &
(u64)0xff00ULL) << 8) | (((u64)(*p) & (u64)0x00ffULL)
>> 8) | (((u64)(*p) & (u64)0xff00ULL) >> 24) | (((u64)(*p) &
(u64)0x00ffULL) >> 40) | (((u64)(*p) & (u64)0xff00ULL)
>> 56))) : __builtin_bswap64(*p));
}
static inline u64 wwn_to_u64(void *wwn)
{
 return __swab64p(wwn);
}

struct Scsi_Host {
 void *shost_data;
 unsigned long hostdata[0];
};
static inline void *shost_priv(struct Scsi_Host *shost)
{
 return (void *)shost->hostdata;
}
typedef struct scsi_qla_host {
 u8 fabric_node_name[8];
 u32 device_flags;
} scsi_qla_host_t;
struct fc_host_attrs {
 u64 fabric_name;
};

static void
qla2x00_get_host_fabric_name(struct Scsi_Host *shost)
{
 scsi_qla_host_t *vha = shost_priv(shost);
 u8 node_name[8] = { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF};
 u64 fabric_name = wwn_to_u64(node_name);
 if (vha->device_flags & 0x1)
  fabric_name = wwn_to_u64(vha->fabric_node_name);
 (((struct fc_host_attrs *)(shost)->shost_data)->fabric_name) = fabric_name;
}

void *get_host_fabric_name = qla2x00_get_host_fabric_name;

[Bug rtl-optimization/21150] Suboptimal byte extraction from 64bits

2016-04-15 Thread vda.linux at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=21150

--- Comment #7 from Denis Vlasenko  ---
Fixed at least in 4.7.2, maybe earlier. With -m32 -fomit-frame-pointer -O2:

a:  movzbl  v+45, %eax
xorbv+36, %al
xorbv, %al
xorbv+54, %al
xorbv+63, %al
xorbv+9, %al
xorbv+18, %al
xorbv+27, %al
ret
b:  movzbl  v+18, %eax
xorbv+9, %al
xorbv, %al
xorbv+27, %al
xorbv+36, %al
xorbv+45, %al
xorbv+54, %al
xorbv+63, %al
ret
c:  movzbl  v+9, %eax
xorbv, %al
xorbv+18, %al
xorbv+27, %al
xorbv+36, %al
xorbv+45, %al
xorbv+54, %al
xorbv+63, %al
ret
d:  movzbl  v+18, %eax
xorbv+9, %al
xorbv, %al
xorbv+27, %al
xorbv+36, %al
xorbv+45, %al
xorbv+54, %al
xorbv+63, %al
ret

With same but -Os, my only complaint is that word-sized XORs are needlessly
adding partial register update stalls:

d:  movbv+18, %al
xorbv+9, %al
xorlv, %eax
xorbv+27, %al
xorlv+36, %eax
xorbv+45, %al
xorbv+54, %al
xorbv+63, %al
ret

but overall it looks much better. Feel free to close this BZ.

[Bug middle-end/66240] RFE: extend -falign-xyz syntax

2016-04-16 Thread vda.linux at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66240

--- Comment #4 from Denis Vlasenko  ---
Created attachment 38293
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38293&action=edit
Proposed patch

This patch implements -falign-functions=N[,M] for now, with the eye for easy
extension to other -falign options.

I tested that with -falign-functions=N (tried 8, 15, 16, 17...) the alignment
directives are the same before and after the patch:

-falign-functions=8  generates ".p2align 3,,7" before and after.
-falign-functions=17 generates ".p2align 5,,16" before and after.

I tested that -falign-functions=N,N (two equal paramenters) works exactly like
-falign-functions=N.

Patch drops currently performed forced alignment to 8 if requested alignment is
higher than 8: before the patch, -falign-functions=9 was generating

.p2align 4,,8
.p2align 3

which means "Align to 16 if the skip is 8 bytes or less; else align to 8".
After the patch, "p2align 3" is not emitted.

I drop that because I ultimately want to do something like
-falign-functions=64,8 - IOW, I want to align functions by 64 bytes, but only
if that entails a skip of less than 8 bytes - otherwise I want **no alignment
at all**. The forced ".p2align 3" interferes with that intention.

This is an RFC-patch, IOW: I don't insist on removal of ".p2align 3"
generation. I imagine that it should be retained for compat, and yet another
option should be added to suppress it if desired (how about
"-mno-8byte-code-subalign"? Argh...)

[Bug middle-end/70703] New: Regression in register usage on x86

2016-04-17 Thread vda.linux at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70703

Bug ID: 70703
   Summary: Regression in register usage on x86
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vda.linux at googlemail dot com
  Target Milestone: ---

$ cat bad.c
unsigned ud_x_641_mul(unsigned x) {
/* optimized version of x / 641 */
return ((unsigned long long)x * 0x663d81) >> 32;
}

With gcc from current svn:
$ gcc -m32 -fomit-frame-pointer -O2 bad.c -S && cat bad.s
...
ud_x_641_mul:
.cfi_startproc
movl$6700417, %ecx
movl%ecx, %eax
mull4(%esp)
movl%edx, %ecx
movl%ecx, %eax
ret

Same result with -Os. Note two pointless mov insns.

gcc 5.3.1 is "better", it adds only one unnecessary insn:

ud_x_641_mul:
.cfi_startproc
movl$6700417, %ecx
movl%ecx, %eax
mull4(%esp)
movl%edx, %eax
ret

gcc 4.4.x and 4.7.2 were generating this code, which looks optimal:
ud_x_641_mul:
.cfi_startproc
movl$6700417, %eax
mull4(%esp)
movl%edx, %eax
ret

I did not test other versions of gcc yet.

[Bug target/30354] -Os doesn't optimize a/CONST even if it saves size.

2016-04-17 Thread vda.linux at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=30354

--- Comment #17 from Denis Vlasenko  ---
Any chance of this being finally done?

I proposed a simple, working patch in 2007, it's 2016 now and all these years
users of -Os suffer from slow divisions in important cases usch as "signed_int
/ 16" and "unsigned_int / 10".

I understand your desire to do it "better", to make gcc count size of div/idiv
more accurately, without having to lie to it in the insn size table. But with
you guys constantly distracted by other, more important issues, what happened
here is that _nothing_ was done...

I retested the patch with current svn (the future 7.0.0), using test program
with 15000 divisions from comment 3:

Bumping division cost up to 8 is no longer enough, this only makes gcc to be
better towards some (not all) 2^N divisors. Bumping div cost to 9..12 helps
with most of remaining 2^N divisor cases, and for two exceptional cases of x /
641 and x / 6700417. Only bumping div cost to 13, namely, changing div costs as
follows:

const
struct processor_costs ix86_size_cost = {/* costs for tuning for size */
...
  {COSTS_N_BYTES (13),  /* cost of a divide/mod for QI */
   COSTS_N_BYTES (13),  /*  HI */
   COSTS_N_BYTES (13),  /*  SI */
   COSTS_N_BYTES (13),  /*  DI */
   COSTS_N_BYTES (15)}, /*  other */

makes it work as it used to in 4.4.x days: out of 15000 cases in t.c, 975 cases
are optimized so that they don't use "div" anymore.

This should have made it smaller too... but did not, because meanwhile gcc has
regressed in another area. Now it inserts superfluous register moves. See bug
70703 which I just filed. Essentially, instead of
movl$6700417, %eax
mull4(%esp)
movl%edx, %eax
ret
gcc generates:
movl$6700417, %ecx
movl%ecx, %eax 
mull4(%esp)
movl%edx, %ecx 
movl%ecx, %eax
ret

Sizes of compiled testcases (objN denotes cost of "div", A...D correspond to
costs of 10..13):

   textdata bss dec hex filename
 242787   0   0  242787   3b463 gcc.obj3/divmod-7.0.0-Os.o
 242813   0   0  242813   3b47d gcc.obj8/divmod-7.0.0-Os.o
 242838   0   0  242838   3b496 gcc.obj9/divmod-7.0.0-Os.o
 242844   0   0  242844   3b49c gcc.objA/divmod-7.0.0-Os.o
 242844   0   0  242844   3b49c gcc.objB/divmod-7.0.0-Os.o
 242844   0   0  242844   3b49c gcc.objC/divmod-7.0.0-Os.o
 247573   0   0  247573   3c715 gcc.objD/divmod-7.0.0-Os.o

So.
Any chance of this patch being accepted sometime before 2100? ;)

[Bug target/30354] -Os doesn't optimize a/CONST even if it saves size.

2016-04-17 Thread vda.linux at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=30354

--- Comment #18 from Denis Vlasenko  ---
Created attachment 38297
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38297&action=edit
Comparison of generated code with 7.0.0.svn on i86

With div cost of 3:

  :
-   0:  8b 44 24 04 mov0x4(%esp),%eax
-   4:  b9 64 00 00 00  mov$0x64,%ecx
-   9:  31 d2   xor%edx,%edx
-   b:  f7 f1   div%ecx
-   d:  c3  ret

With div cost of 13:

+   0:  b9 1f 85 eb 51  mov$0x51eb851f,%ecx
+   5:  89 c8   mov%ecx,%eax
+   7:  f7 64 24 04 mull   0x4(%esp)
+   b:  89 d1   mov%edx,%ecx
+   d:  89 c8   mov%ecx,%eax
+   f:  c1 e8 05shr$0x5,%eax
+  12:  c3  ret

[Bug middle-end/66240] RFE: extend -falign-xyz syntax

2017-04-17 Thread vda.linux at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66240

--- Comment #6 from Denis Vlasenko  ---
Patches v7 are posted:

https://gcc.gnu.org/ml/gcc-patches/2017-04/msg00720.html
https://gcc.gnu.org/ml/gcc-patches/2017-04/msg00721.html
https://gcc.gnu.org/ml/gcc-patches/2017-04/msg00722.html
https://gcc.gnu.org/ml/gcc-patches/2017-04/msg00723.html

[Bug c/77966] Corrupt function with -fsanitize-coverage=trace-pc

2016-10-13 Thread vda.linux at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77966

Denis Vlasenko  changed:

   What|Removed |Added

 CC||vda.linux at googlemail dot com

--- Comment #1 from Denis Vlasenko  ---
Simplified a bit:
- spinlock_t is not essential
- mempool_t is not essential
- snic_log_q_error_err_status variable is not necessary
- __attribute__ ((__aligned__)) can be dropped too
- struct vnic_wq can be folded
OTOH:
- struct vnic_wq_ctrl wrapping of int variable is necessary
- wq_lock[1] unused member is necessary (makes gcc "know for sure" that wq[1]
is 1-element array)
- each of -O2 -fno-reorder-blocks -fsanitize-coverage=trace-pc are necessary


extern unsigned int ioread32(void *);
struct vnic_wq_ctrl {
unsigned int error_status;
};
struct snic {
unsigned int wq_count;
struct vnic_wq_ctrl *wq[1];
int wq_lock[1];
};
void snic_log_q_error(struct snic *snic)
{
unsigned int i;
for (i = 0; i < snic->wq_count; i++)
ioread32(&snic->wq[i]->error_status);
}


:
   0:   53  push   %rbx
   1:   48 89 fbmov%rdi,%rbx
   4:   e8 00 00 00 00  callq  __sanitizer_cov_trace_pc
   9:   8b 03   mov(%rbx),%eax
   b:   85 c0   test   %eax,%eax  # snic->wq_count==0?
   d:   75 09   jne18
   f:   5b  pop%rbx # yes, 0
  10:   e9 00 00 00 00  jmpq   __sanitizer_cov_trace_pc #tail call
  15:   0f 1f 00nopl   (%rax)

  18:   e8 00 00 00 00  callq  __sanitizer_cov_trace_pc
  1d:   48 8b 7b 08 mov0x8(%rbx),%rdi
  21:   e8 00 00 00 00  callq  ioread32
  26:   83 3b 01cmpl   $0x1,(%rbx) # snic->wq_count<=1?
  29:   76 e4   jbef
  2b:   e8 00 00 00 00  callq  __sanitizer_cov_trace_pc


Looks like gcc thinks that the loop can execute only zero or one time
(or else we run off wq[1]). So when it iterated once:

  21:   e8 00 00 00 00  callq  ioread32

it checks that snic->wq_count <= 1

  26:   83 3b 01cmpl   $0x1,(%rbx)
  29:   76 e4   jbef

and if not, we are in "impossible" land and just stop codegen.
-fsanitize-coverage=trace-pc generator twitches one last time:

  2b:   e8 00 00 00 00  callq  __sanitizer_cov_trace_pc


[Bug c/77966] Corrupt function with -fsanitize-coverage=trace-pc

2016-10-13 Thread vda.linux at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77966

--- Comment #2 from Denis Vlasenko  ---
Without -fsanitize-coverage=trace-pc, the second, redundant check
"snic->wq_count<=1?" is not generated. This eliminates the hanging "impossible"
code path:

:
   0:   8b 07   mov(%rdi),%eax
   2:   85 c0   test   %eax,%eax
   4:   74 09   je f
   6:   48 8b 7f 08 mov0x8(%rdi),%rdi
   a:   e9 00 00 00 00  jmpq   ioread32
   f:   c3  retq

[Bug target/77966] Corrupt function with -fsanitize-coverage=trace-pc

2016-10-14 Thread vda.linux at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77966

--- Comment #4 from Denis Vlasenko  ---
This confuses object code sanity analysis tools which check that every function
ends "properly", i.e. with a return or jump (possibly padded with nops).

Can gcc get an option like -finsert-stop-insn-when-unreachable[=insn], making
bad programs crash if they do reach "impossible" code, rather than happily
running off and executing random stuff?

For x86, one-byte INT3, INT1, HLT or two-byte UD2 insn would be a good choice.

[Bug c/65410] New: "Short local string array" optimization doesn't happen if string has NULs

2015-03-12 Thread vda.linux at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65410

Bug ID: 65410
   Summary: "Short local string array" optimization doesn't happen
if string has NULs
   Product: gcc
   Version: 4.7.2
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vda.linux at googlemail dot com

void f(char *);
void g() {
char buf[12] = "1234567890";
f(buf);
}

In the above example, "gcc -O2" creates buf[12] with immediate stores:

subq$24, %rsp
movabsq $4050765991979987505, %rax
movq%rsp, %rdi
movq%rax, (%rsp)
movl$12345, 8(%rsp)
callf
addq$24, %rsp
ret

But if buf[] definition has \0 anywhere (for example, at the end where it does
not even change the semantics of the code), optimization is not happening, gcc
allocates a constant string and copies it into buf[]:

void f(char *);
void g() {
char buf[12] = "1234567890\0";
f(buf);
}

.section.rodata
.LC0:
.string "1234567890"
.string ""

.text
g:
subq$24, %rsp
movq.LC0(%rip), %rax
movq%rsp, %rdi
movq%rax, (%rsp)
movl.LC0+8(%rip), %eax
movl%eax, 8(%rsp)
callf
addq$24, %rsp
ret


[Bug c/66122] New: Bad uninlining decisions

2015-05-12 Thread vda.linux at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122

Bug ID: 66122
   Summary: Bad uninlining decisions
   Product: gcc
   Version: 4.9.2
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vda.linux at googlemail dot com
  Target Milestone: ---

On linux kernel build, I found thousands of cases where functions which are
expected (by programmer) to be inlined, aren't actually inlined.

The following script is used to find them:

nm --size-sort vmlinux | grep -iF ' t ' | uniq -c | grep -v '^ *1 ' | sort -rn

It caltually finds functions which have same name, size, and occur more than
once. There are a few false positives, but vast majority of them are functions
which were supposed to be inlined, but weren't:

(Count) (size) (name)
473 000b t spin_unlock_irqrestore
449 005f t rcu_read_unlock
355 0009 t atomic_inc
353 006e t rcu_read_lock
350 0075 t rcu_read_lock_sched_held
291 000b t spin_unlock
266 0019 t arch_local_irq_restore
215 000b t spin_lock
180 0011 t kzalloc
165 0012 t list_add_tail
161 0019 t arch_local_save_flags
153 0016 t test_and_set_bit
134 000b t spin_unlock_irq
134 0009 t atomic_dec
130 000b t spin_unlock_bh
122 0010 t brelse
120 0016 t test_and_clear_bit
120 000b t spin_lock_irq
119 001e t get_dma_ops
117 0053 t cpumask_next
116 0036 t kref_get
114 001a t schedule_work
106 000b t spin_lock_bh
103 0019 t arch_local_irq_disable
 98 0014 t atomic_dec_and_test
 83 0020 t sg_page
 81 0037 t cpumask_check
 79 0036 t pskb_may_pull
 72 0044 t perf_fetch_caller_regs
 70 002f t cpumask_next
 68 0036 t clk_prepare_enable
 65 0018 t pci_write_config_byte
 65 0013 t tasklet_schedule
 61 0023 t init_completion
 60 002b t trace_handle_return
 59 0043 t nlmsg_trim
 59 0019 t pci_read_config_dword
 59 000c t slow_down_io
...
...

Note tiny sizes of some functions. Let's take a look at atomic_inc:

static inline void atomic_inc(atomic_t *v)
{
asm volatile(LOCK_PREFIX "incl %0"
 : "+m" (v->counter));
}

You would imagine that this won't ever be deinlined, right? It's one assembly
instruction. Well, it isn't always inlined. Here's the disassembly of vmlinux:

81003000 :
81003000:   55  push   %rbp
81003001:   48 89 e5mov%rsp,%rbp
81003004:   f0 ff 07lock incl (%rdi)
81003007:   5d  pop%rbp
81003008:   c3  retq

This can be fixed using __always_inline, but kernel developers hesitate to slap
thousands of __always_inline everywhere, the mood is that this is a compiler's
fault and it should not be accomodated for, but fixed.

This happens quite easily with -Os (IOW: with CC_OPTIMIZE_FOR_SIZE=y kernel
build), but -O2 is not immune either.

I found a file which exhibits an example of bad deinlining for both -O2 and -Os
and I'm going to attach it.


[Bug c/66122] Bad uninlining decisions

2015-05-12 Thread vda.linux at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122

--- Comment #1 from Denis Vlasenko  ---
Created attachment 35528
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=35528&action=edit
Preprocessed example exhibiting a bug

This is a preprocessed kernel/locking/mutex.c file from kernel source.
When built with either -O2 or -Os, it wrongly deinlines spin_lock() and
spin_unlock():

$ gcc -O2 -c mutex.preprocessed.c -o mutex.preprocessed.o
$ objdump -dr mutex.preprocessed.o
mutex.preprocessed.o: file format elf64-x86-64
Disassembly of section .text:
 :
   0:   80 07 01addb   $0x1,(%rdi)
   3:   c3  retq
   4:   66 66 66 2e 0f 1f 84data32 data32 nopw %cs:0x0(%rax,%rax,1)
   b:   00 00 00 00 00
0010 <__mutex_init>:
...
0040 :
  40:   e9 00 00 00 00  jmpq   45 
41: R_X86_64_PC32   _raw_spin_lock-0x4
  45:   66 66 2e 0f 1f 84 00data32 nopw %cs:0x0(%rax,%rax,1)
  4c:   00 00 00 00

These functions are defined as:

static inline __attribute__((no_instrument_function)) void
spin_unlock(spinlock_t *lock)
{
 __raw_spin_unlock(&lock->rlock);
}

static inline __attribute__((no_instrument_function)) void spin_lock(spinlock_t
*lock)
{
 _raw_spin_lock(&lock->rlock);
}

and programmer's intent was that they will always be inlined.

This is with gcc-4.7.2


[Bug c/66122] Bad uninlining decisions

2015-05-12 Thread vda.linux at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122

--- Comment #2 from Denis Vlasenko  ---
Tested with gcc-4.9.2. The attached testcase doesn't exhibit the bug, but
compiling the same kernel tree, with the same .config, and then running

nm --size-sort vmlinux | grep -iF ' t ' | uniq -c | grep -v '^ *1 ' | sort -rn

reveals that now other functions get wrongly deinlined:

  8 0028 t acpi_os_allocate_zeroed
  7 0011 t dst_output_sk
  7 000b t hweight_long
  5 0023 t umask_show
  5 000f t init_once
  4 0047 t uni2char
  4 0028 t cmask_show
  4 0025 t inv_show
  4 0025 t edge_show
  4 0020 t char2uni
  4 001f t event_show
  4 001d t acpi_node
  4 0012 t t_stop
  4 0012 t dst_discard
  4 0011 t kzalloc
  4 000b t udp_lib_close
  4 0006 t udp_lib_hash
  3 0059 t get_expiry
  3 0025 t __uncore_inv_show
  3 0025 t __uncore_edge_show
  3 0023 t __uncore_umask_show
  3 0023 t name_show
  3 0022 t acpi_os_allocate
  3 001f t __uncore_event_show
  3 000d t cpumask_set_cpu
  3 000a t nofill
...
...

For example, hweight_long:

static inline unsigned long hweight_long(unsigned long w)
{
return sizeof(w) == 4 ? hweight32(w) : hweight64(w);
}

wasn't expected by programmer to be deinlined. But it was:

81009c40 :
81009c40:   55  push   %rbp
81009c41:   e8 da eb 31 00  callq  81328820
<__sw_hweight64>
81009c46:   48 89 e5mov%rsp,%rbp
81009c49:   5d  pop%rbp
81009c4a:   c3  retq   
81009c4b:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)

I'm going to find and attach a file which deinlines hweight_long.


[Bug c/66122] Bad uninlining decisions

2015-05-12 Thread vda.linux at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122

--- Comment #3 from Denis Vlasenko  ---
Created attachment 35530
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=35530&action=edit
Preprocessed example exhibiting a bug on gcc -4.9.2

This is a preprocessed kernel/pid.c file from kernel source.
When built with -O2, it wrongly deinlines hweight_long.

$ gcc -O2 -c pid.preprocessed.c -o kernel.pid.o
$ objdump -dr kernel.pid.o | grep -A3 hweight_long
 :
   0:   e8 00 00 00 00  callq  5 
1: R_X86_64_PC32__sw_hweight64-0x4
   5:   c3  retq
$ gcc -v 2>&1 | tail -1
gcc version 4.9.2 20150212 (Red Hat 4.9.2-6) (GCC)


[Bug ipa/65740] spectacularly bad inlinining decisions with -Os

2015-05-12 Thread vda.linux at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65740

Denis Vlasenko  changed:

   What|Removed |Added

 CC||vda.linux at googlemail dot com

--- Comment #3 from Denis Vlasenko  ---
Bug 66122 contains more information, and a recipe how to find many examples
using linux kernel build.

For one, this is not limited to -Os (it does happen with -Os way more easily).


[Bug c/66122] Bad uninlining decisions

2015-05-13 Thread vda.linux at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122

--- Comment #6 from Denis Vlasenko  ---
Got a hold on a machine with gcc version 5.1.1 20150422 (Red Hat 5.1.1-1)

Pulled current Linus kernel tree and built it with this config:
http://busybox.net/~vda/kernel_config2
Note that "CONFIG_CC_OPTIMIZE_FOR_SIZE is not set", i.e. it's a -O2 build.

Selecting duplicate functions still shows a number of tiny uninlined functions:

$ nm --size-sort vmlinux | grep -iF ' t ' | uniq -c | grep -v '^ *1 ' | sort
-rn
 83 008a t rcu_read_lock_sched_held
 48 001b t sd_driver_init
 48 0012 t sd_driver_exit
 48 0008 t __initcall_sd_driver_init6
 47 0020 t usb_serial_module_init
 47 0012 t usb_serial_module_exit
 47 0008 t __initcall_usb_serial_module_init6
 45 0057 t uni2char
 45 0025 t char2uni
 43 001f t sd_probe
 40 006a t rcu_read_unlock
 29 005a t cpumask_next
 27 007a t rcu_read_lock
 27 0011 t kzalloc
 24 0022 t arch_local_save_flags
 23 0041 t cpumask_check
 19 0017 t phy_module_init
 19 0017 t phy_module_exit
 19 0008 t __initcall_phy_module_init6
 18 006c t spi_write
 18 003f t show_alarm
 18 000b t bitmap_weight
 15 0037 t show_alarms
 15 0014 t init_once
 14 0603 t init_engine
 14 0354 t pcm_trigger
 14 033b t pcm_open
 14 00f8 t stop_transport
 14 00db t pcm_close
 14 00c8 t set_meters_on
 14 00b5 t write_dsp
 14 00b5 t pcm_hw_free
 14 0091 t pcm_pointer
 14 0090 t hw_rule_playback_channels_by_format
 14 008d t send_vector
 14 004f t snd_echo_vumeters_info
 14 0042 t hw_rule_sample_rate
 14 003e t snd_echo_vumeters_switch_put
 14 0034 t audiopipe_free
 14 002b t snd_echo_channels_info_info
 14 0024 t snd_echo_remove
 14 001b t echo_driver_init
 14 0019 t pcm_analog_out_hw_params
 14 0019 t arch_local_irq_restore
 14 0014 t snd_echo_dev_free
 14 0012 t echo_driver_exit
 14 0008 t __initcall_echo_driver_init6
 13 0127 t pcm_analog_out_open
 13 0127 t pcm_analog_in_open
 13 0039 t qdisc_peek_dequeued
 13 0037 t cpumask_check
 13 0022 t arch_local_irq_restore
 13 001c t pcm_analog_in_hw_params
 13 0006 t bcma_host_soc_unregister_driver
 12 0053 t nlmsg_trim
...

Such as:
811a42e0 :
811a42e0:   55  push   %rbp
811a42e1:   81 ce 00 80 00 00   or $0x8000,%esi
811a42e7:   48 89 e5mov%rsp,%rbp
811a42ea:   e8 f1 92 1a 00  callq  <__kmalloc>
811a42ef:   5d  pop%rbp
811a42f0:   c3  retq

810792d0 :
810792d0:   55  push   %rbp
810792d1:   48 89 e5mov%rsp,%rbp
810792d4:   e8 37 a8 b7 00  callq  <__bitmap_weight>
810792d9:   5d  pop%rbp
810792da:   c3  retq

and even
88566c9b :
88566c9b:   55  push   %rbp
88566c9c:   48 89 e5mov%rsp,%rbp
88566c9f:   5d  pop%rbp
88566ca0:   c3  retq

This is an *empty function* from drivers/bcma/bcma_private.h:103 uninlined:
static inline void __exit bcma_host_soc_unregister_driver(void)
{
}

BTW it doesn't even have any callers in vmlinux. It should have been optimized
out.


[Bug ipa/66122] Bad uninlining decisions

2015-05-18 Thread vda.linux at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122

--- Comment #8 from Denis Vlasenko  ---
If you try to reproduce this with kernel build, be sure to not select
CONFIG_OPTIMIZE_INLINING (it forces inlining by making all iniline functions
__always_inline).

I didn't mention it before, but the recent (as of this writing) gcc 5.1.1
20150422 (Red Hat 5.1.1-1) with -Os easily triggers this behavior (more than a
thousand *.o modules with spurious deinlines during kernel build).


[Bug ipa/66122] Bad uninlining decisions

2015-05-18 Thread vda.linux at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122

--- Comment #10 from Denis Vlasenko  ---
(In reply to Jakub Jelinek from comment #9)
> If you expect that all functions with inline keyword must be always inlined,
> then you really should use __always_inline__ attribute.  Otherwise, inline
> keyword is primarily an optimization hint to the compiler that it might be
> desirable to inline it. So, talking about uninlining or deinlining makes
> absolutely no sense,

Jakub, are you saying that compiling

static inline oid spin_unlock(spinlock_t *lock)
{
 __raw_spin_unlock(&lock->rlock);
}

, where __raw_spin_unlock is a function (not macro), to a deinlined function

spin_unlock:
call __raw_spin_unlock
ret


and then callers doing

 call spin_unlock

*can ever* make sense? That's ridiculous.


How about this?

static inline void atomic_inc(atomic_t *v)
{
asm volatile(LOCK_PREFIX "incl %0"
 : "+m" (v->counter));
}

You think it's okay to not inline one insn?


Kernel people did not take my patch which tries to fix this by __always_inining
locking ops. Basically, they think that compiler should not do stupid things.


[Bug c/66240] New: RFE: extend -falign-xyz syntax

2015-05-21 Thread vda.linux at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66240

Bug ID: 66240
   Summary: RFE: extend -falign-xyz syntax
   Product: gcc
   Version: 5.1.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vda.linux at googlemail dot com
  Target Milestone: ---

Experimentally, compilation with
-O2 -falign-functions=17 -falign-loops=17 -falign-jumps=17 -falign-labels=17
results in the following:
- functions are aligned using ".p2align 5,,16" asm directive
- loops/jumps/labels are aligned using ".p2align 5"

-Os -falign-functions=17 -falign-loops=17 -falign-jumps=17 -falign-labels=17
results in the following:
- functions are not aligned
- loops/jumps/labels are aligned using ".p2align 5"

Can this be improved so that in all cases, ".p2align 5,,16" is used? Shouldn't
be that hard...


Next step (what this RFE is all about). -falign-functions=N is too simplistic.
Ingo Molnar ran some tests and it looks on latest x86 CPUs, 64-byte alignment
runs fastest (he tried many other possibilites).

However, developers are less than thrilled by the idea of a slam-dunk 64-byte
aligning everything. Too much waste:
On 05/20/2015 02:47 AM, Linus Torvalds wrote:
> At the same time, I have to admit that I abhor a 64-byte function
> alignment, when we have a fair number of functions that are (much)
> smaller than that.
> 
> Is there some way to get gcc to take the size of the function into
> account? Because aligning a 16-byte or 32-byte function on a 64-byte
> alignment is just criminally nasty and wasteful.

I propose the following: align function to 64-byte boundaries *IF* this does
not introduce huge amount of padding.

GNU as already has support for this:

.align N1,FILL,N3

"The third expression is also absolute, and is also optional.
If it is present, it is the maximum number of bytes that should
be skipped by this alignment directive."

So, what we want is to put something like ".align 64,,7"
before every function. 98% of functions in typical linux kernel have first
instruction 7 or fewer bytes long. Thus, with ".align 64,,7", calling any
function will at a minimum be able to fetch one insn in one L1 read, not two.
And this would be acheved with only ~3.5 bytes per function wasted to padding
on average, whereas ".align 64" would waste 31 byte on average.

Please extend -falign-foo=N syntax to, say, -falign-foo=N,M, which generates
".align M,,N-1" or equivalent.


[Bug c/66240] RFE: extend -falign-xyz syntax

2015-05-22 Thread vda.linux at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66240

--- Comment #2 from Denis Vlasenko  ---
(In reply to Josh Triplett from comment #1)
> Another alternative discussed in that thread, which seems near-ideal: align
> functions to a given size (for instance, 64 bytes), pack them into that size
> if they fit, but avoid splitting a function across that boundary unless it's
> larger than that boundary.

Josh, I would be more than happy to see gcc/ld becoming clever enough to pack
functions intelligently (say, align big ones to cacheline boundaries, and fit
tiny ones into the resulting padding "holes"). I'm afraid in the current state
of gcc code, that'll be a very tall order to fulfil.

In this BZ, I'm asking for something easy-ish to be done.


[Bug rtl-optimization/64907] New: Suboptimal code (saving rbx on stack in order to save another reg in rbx)

2015-02-02 Thread vda.linux at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64907

Bug ID: 64907
   Summary: Suboptimal code (saving rbx on stack in order to save
another reg in rbx)
   Product: gcc
   Version: 4.7.2
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vda.linux at googlemail dot com

void put_16bit(unsigned short v);
void put_32bit(unsigned v)
{
put_16bit(v);
put_16bit(v >> 16);
}

With gcc 4.7.2 the above compiles to the following assembly:

put_32bit:
pushq   %rbx
movl%edi, %ebx
andl$65535, %edi
callput_16bit
movl%ebx, %edi
popq%rbx
shrl$16, %edi
jmp put_16bit

Code saves %rbx on stack only in order to save %edi to %ebx.
A simpler alternative is to just save %rdi on stack:

put_32bit:
pushq   %rdi
andl$65535, %edi
callput_16bit
popq%rdi
shrl$16, %edi
jmp put_16bit


[Bug middle-end/66240] RFE: extend -falign-xyz syntax

2016-08-30 Thread vda.linux at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66240

--- Comment #5 from Denis Vlasenko  ---
Patches v3 posted to the mailing list:

https://gcc.gnu.org/ml/gcc-patches/2016-08/msg02073.html
https://gcc.gnu.org/ml/gcc-patches/2016-08/msg02074.html

[Bug c/100320] New: regression: 32-bit x86 memcpy is suboptimal

2021-04-28 Thread vda.linux at googlemail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100320

Bug ID: 100320
   Summary: regression: 32-bit x86 memcpy is suboptimal
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vda.linux at googlemail dot com
  Target Milestone: ---

Bug 21329 has returned.

32-bit x86 memory block moves are using "movl $LEN,%ecx; rep movsl" insns.

However, for fixed short blocks it is more efficient to just repeat a few
"movsl" insns - this allows to drop "mov $LEN,%ecx" insn.

It's shorter, and more importantly, "rep movsl" are slow-start microcoded insns
(they are faster than moves using general-purpose registers only on blocks
larger than 100-200 bytes) - OTOH, bare "movsl" are not microcoded and take ~4
cycles to execute.

21329 was closed with it fixed:

CVSROOT:/cvs/gcc
Module name:gcc
Branch: gcc-4_0-rhl-branch
Changes by: ja...@gcc.gnu.org   2005-05-18 19:08:44
Modified files:
gcc: ChangeLog 
gcc/config/i386: i386.c 
Log message:
2005-05-06  Denis Vlasenko  
Jakub Jelinek 
PR target/21329
* config/i386/i386.c (ix86_expand_movmem): Don't use rep; movsb
for -Os if (movsl;)*(movsw;)?(movsb;)? sequence is shorter.
Don't use rep; movs{l,q} if the repetition count is really small,
instead use a sequence of movs{l,q} instructions.

(the above is commit 95935e2db5c45bef5631f51538d1e10d8b5b7524 in
gcc.gnu.org/git/gcc.git,
seems that code was largely replaced by:
commit 8c996513856f2769aee1730cb211050fef055fb5
Author: Jan Hubicka 
Date:   Mon Nov 27 17:00:26 2006 +010
expr.c (emit_block_move_via_libcall): Export.
)


With gcc version 11.0.0 20210210 (Red Hat 11.0.0-0) (GCC) I see "rep movsl"s
again:

void *f(void *d, const void *s)
{ return memcpy(d, s, 16); }

$ gcc -Os -m32 -fomit-frame-pointer -c -o z.o z.c && objdump -drw z.o
z.o: file format elf32-i386
Disassembly of section .text:
 :
   0:   57  push   %edi
   1:   b9 04 00 00 00  mov$0x4,%ecx
   6:   56  push   %esi
   7:   8b 44 24 0c mov0xc(%esp),%eax
   b:   8b 74 24 10 mov0x10(%esp),%esi
   f:   89 c7   mov%eax,%edi
  11:   f3 a5   rep movsl %ds:(%esi),%es:(%edi)
  13:   5e  pop%esi
  14:   5f  pop%edi
  15:   c3  ret 

The expected code would not have "mov $0x4,%ecx" and would have "rep movsl"
replaced by "movsl;movsl;movsl;movsl".

The testcase from 21329 with implicit block moves via struct copies, from here
https://gcc.gnu.org/bugzilla/attachment.cgi?id=8790
also demonstrates it:

$ gcc -Os -m32 -fomit-frame-pointer -c -o z1.o z1.c && objdump -drw z1.o
z1.o: file format elf32-i386
Disassembly of section .text:
 :
   0:   a1 00 00 00 00  mov0x0,%eax 1: R_386_32 w10
   5:   a3 00 00 00 00  mov%eax,0x0 6: R_386_32 t10
   a:   c3  ret
000b :
   b:   a1 00 00 00 00  mov0x0,%eax c: R_386_32 w20
  10:   8b 15 04 00 00 00   mov0x4,%edx 12: R_386_32w20
  16:   a3 00 00 00 00  mov%eax,0x0 17: R_386_32t20
  1b:   89 15 04 00 00 00   mov%edx,0x4 1d: R_386_32t20
  21:   c3  ret
0022 :
  22:   57  push   %edi
  23:   b9 09 00 00 00  mov$0x9,%ecx
  28:   bf 00 00 00 00  mov$0x0,%edi29: R_386_32t21
  2d:   56  push   %esi
  2e:   be 00 00 00 00  mov$0x0,%esi2f: R_386_32w21
  33:   f3 a4   rep movsb %ds:(%esi),%es:(%edi)
  35:   5e  pop%esi
  36:   5f  pop%edi
  37:   c3  ret
0038 :
  38:   57  push   %edi
  39:   b9 0a 00 00 00  mov$0xa,%ecx
  3e:   bf 00 00 00 00  mov$0x0,%edi3f: R_386_32t22
  43:   56  push   %esi
  44:   be 00 00 00 00  mov$0x0,%esi45: R_386_32w22
  49:   f3 a4   rep movsb %ds:(%esi),%es:(%edi)
  4b:   5e  pop%esi
  4c:   5f  pop%edi
  4d:   c3  ret
004e :
  4e:   57  push   %edi
  4f:   b9 0b 00 00 00  mov$0xb,%ecx
  54:   bf 00 00 00 00  mov$0x0,%edi55: R_386_32t23
  59:   56  push   %esi
  5a:   be 00 00 00 00  mov$0x0,%esi5b: R_386_32w23
  5f:   f3 a4   rep movsb

[Bug target/100320] [8/9/10/11/12 Regression] 32-bit x86 memcpy is suboptimal

2021-04-28 Thread vda.linux at googlemail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100320

--- Comment #2 from Denis Vlasenko  ---
The relevant code in current git seems to be:

static void
expand_set_or_cpymem_via_rep (rtx destmem, rtx srcmem,
   rtx destptr, rtx srcptr, rtx value, rtx orig_value,
   rtx count,
   machine_mode mode, bool issetmem)
{
  rtx destexp;
  rtx srcexp;
  rtx countreg;
  HOST_WIDE_INT rounded_count;

  /* If possible, it is shorter to use rep movs.
 TODO: Maybe it is better to move this logic to decide_alg.  */
  if (mode == QImode && CONST_INT_P (count) && !(INTVAL (count) & 3)
  && !TARGET_PREFER_KNOWN_REP_MOVSB_STOSB
  && (!issetmem || orig_value == const0_rtx))
mode = SImode;

  if (destptr != XEXP (destmem, 0) || GET_MODE (destmem) != BLKmode)
destmem = adjust_automodify_address_nv (destmem, BLKmode, destptr, 0);

  countreg = ix86_zero_extend_to_Pmode (scale_counter (count,
   GET_MODE_SIZE (mode)));
  if (mode != QImode)
{
  destexp = gen_rtx_ASHIFT (Pmode, countreg,
GEN_INT (exact_log2 (GET_MODE_SIZE (mode;
  destexp = gen_rtx_PLUS (Pmode, destexp, destptr);
}
  else
destexp = gen_rtx_PLUS (Pmode, destptr, countreg);
  if ((!issetmem || orig_value == const0_rtx) && CONST_INT_P (count))
{
  rounded_count
= ROUND_DOWN (INTVAL (count), (HOST_WIDE_INT) GET_MODE_SIZE (mode));
  destmem = shallow_copy_rtx (destmem);
  set_mem_size (destmem, rounded_count);
}
  else if (MEM_SIZE_KNOWN_P (destmem))
clear_mem_size (destmem);

  if (issetmem)
{
  value = force_reg (mode, gen_lowpart (mode, value));
  emit_insn (gen_rep_stos (destptr, countreg, destmem, value, destexp));
}
  else
{
  if (srcptr != XEXP (srcmem, 0) || GET_MODE (srcmem) != BLKmode)
srcmem = adjust_automodify_address_nv (srcmem, BLKmode, srcptr, 0);
  if (mode != QImode)
{
  srcexp = gen_rtx_ASHIFT (Pmode, countreg,
   GEN_INT (exact_log2 (GET_MODE_SIZE
(mode;
  srcexp = gen_rtx_PLUS (Pmode, srcexp, srcptr);
}
  else
srcexp = gen_rtx_PLUS (Pmode, srcptr, countreg);
  if (CONST_INT_P (count))
{
  rounded_count
= ROUND_DOWN (INTVAL (count), (HOST_WIDE_INT) GET_MODE_SIZE
(mode));
  srcmem = shallow_copy_rtx (srcmem);
  set_mem_size (srcmem, rounded_count);
}
  else
{
  if (MEM_SIZE_KNOWN_P (srcmem))
clear_mem_size (srcmem);
}
  emit_insn (gen_rep_mov (destptr, destmem, srcptr, srcmem, countreg,
  destexp, srcexp));
}
}

[Bug c/115875] New: -Oz optimization of "push IMM; pop REG" is used incorrectly for 64-bit constants with 31th bit set

2024-07-11 Thread vda.linux at googlemail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115875

Bug ID: 115875
   Summary: -Oz optimization of "push IMM; pop REG" is used
incorrectly for 64-bit constants with 31th bit set
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vda.linux at googlemail dot com
  Target Milestone: ---

void sp_256_sub_8_p256_mod(unsigned long *r)
{
unsigned long reg, ooff;
asm volatile (
"\n subq$0x, (%0)"
"\n sbbq%1, 1*8(%0)"
"\n sbbq$0, 2*8(%0)"
"\n movq3*8(%0), %2"
"\n sbbq$0, %2"
"\n addq%1, %2"
"\n movq%2, 3*8(%0)"
: "=r" (r), "=r" (ooff), "=r" (reg)
: "0" (r), "1" (0x)
: "memory");
}

"gcc -fomit-frame-pointer -Oz -S tls_sp_c32.c" generates this:

pushq   $-1
popq%rax # BUG!!! gcc thinks %rax = 0x
 # but, of course, it loads 0x instead!
subq$0x, (%rdi)
sbbq%rax, 1*8(%rdi)
sbbq$0, 2*8(%rdi)
movq3*8(%rdi), %rdx
sbbq$0, %rdx
addq%rax, %rdx
movq%rdx, 3*8(%rdi)
ret

Looks like either gcc thinks "pushq $-1" truncates the value by 32 bits (in
reality, it is sign-extended), or it thinks it uses "pop %eax" insn (no such
insn exists in 64-bit mode, only 64-bit register pops are possible do).

Code generated with -Os is correct:

orl $-1, %eax  # zero-extended to 64 bits, correct result in %rax
subq$0x, (%rdi)
sbbq%rax, 1*8(%rdi)
sbbq$0, 2*8(%rdi)
movq3*8(%rdi), %rdx
sbbq$0, %rdx
addq%rax, %rdx
movq%rdx, 3*8(%rdi)
ret

In fact, in this case "push IMM+pop REG" is 3 bytes and "orl $-1, %eax" is 3
bytes too (8-bit immediate form), so -Oz optimization is not a win here (same
size, slower code).

$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/14/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-redhat-linux
Configured with: ../configure --enable-bootstrap
--enable-languages=c,c++,fortran,objc,obj-c++,ada,go,d,m2,lto --prefix=/usr
--mandir=/usr/share/man --infodir=/usr/share/info
--with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared
--enable-threads=posix --enable-checking=release --enable-multilib
--with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions
--enable-gnu-unique-object --enable-linker-build-id
--with-gcc-major-version-only --enable-libstdcxx-backtrace
--with-libstdcxx-zoneinfo=/usr/share/zoneinfo --with-linker-hash-style=gnu
--enable-plugin --enable-initfini-array
--with-isl=/builddir/build/BUILD/gcc-14.0.1-20240328/obj-x86_64-redhat-linux/isl-install
--enable-offload-targets=nvptx-none,amdgcn-amdhsa --enable-offload-defaulted
--without-cuda-driver --enable-gnu-indirect-function --enable-cet
--with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux
--with-build-config=bootstrap-lto --enable-link-serialization=1
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 14.0.1 20240328 (Red Hat 14.0.1-0) (GCC)

[Bug target/115875] -Oz optimization of "push IMM; pop REG" is used incorrectly for 64-bit constants with 31th bit set

2024-07-11 Thread vda.linux at googlemail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115875

--- Comment #2 from Denis Vlasenko  ---


0xUL works, although it uses

b8 ff ff ff ff  mov$0x,%eax

instead of

83 c8 ffor $0x,%eax