[Bug c++/91043] New: GCC produces unaligned vmovdqa vector data access

2019-07-01 Thread hhaim at cisco dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91043

Bug ID: 91043
   Summary: GCC produces unaligned vmovdqa vector data access
   Product: gcc
   Version: 8.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hhaim at cisco dot com
  Target Milestone: ---

**The project**: 
https://github.com/cisco-system-traffic-generator/trex-core

**how to compile**: 
https://github.com/cisco-system-traffic-generator/trex-core/wiki#how-to-build-trex

The commit with a workaround:

https://github.com/cisco-system-traffic-generator/trex-core/commit/39e7f535f96f0f5b4406db667be7bc775ce3e515

**The issue**: 
gcc 7/8 generate vector instruction on a variables that was allocated by the
gcc and it seems as not aligned 


the struct is defined like that 

static CGlobalTRex g_trex;

It includes 

CLatencyManager m_mg; 

which includes 

CLatencyManagerPerPort  m_ports[TREX_MAX_PORTS];


class CLatencyManagerPerPort {
public:
 CCPortLatency  m_port;   << crash is on the function reset of this
object 
 CPortLatencyHWBase  *  m_io;
 uint32_t   m_flag;
};


**Workaround**: 

Adding no-sse to this function solves the issue 

__attribute__((noinline,target("no-sse2"))) 
void CCPortLatency::reset(){





void CCPortLatency::reset(){


warning: bad breakpoint number at or near '0x585763'
(gdb) disassemble 0x585763 
Dump of assembler code for function CCPortLatency::Create(unsigned char,
unsigned short, unsigned short, unsigned short, CCPortLatency*,
CLatencyPktMode*, CNatRxManager*):
   0x005856a0 <+0>: push   %rbp
   0x005856a1 <+1>: mov%rsp,%rbp
   0x005856a4 <+4>: push   %r12
   0x005856a6 <+6>: push   %r10
   0x005856a8 <+8>: lea0x10(%rbp),%r10
   0x005856ac <+12>:push   %rbx
   0x005856ad <+13>:mov%rdi,%rbx
   0x005856b0 <+16>:sub$0x8,%rsp
   0x005856b4 <+20>:mov(%r10),%rax
   0x005856b7 <+23>:movb   $0x0,0x3f(%rbx)
   0x005856bb <+27>:mov0x8(%r10),%rdi
   0x005856bf <+31>:mov%rax,(%rbx)
   0x005856c2 <+34>:test   %rax,%rax
   0x005856c5 <+37>:je 0x585795 
   0x005856cb <+43>:mov%esi,%eax
   0x005856cd <+45>:mov%sil,0x31(%rbx)
   0x005856d1 <+49>:movzbl %sil,%esi
   0x005856d5 <+53>:not%eax
   0x005856d7 <+55>:mov%rdi,0x8(%rbx)
   0x005856db <+59>:and$0x1,%eax
   0x005856de <+62>:movb   $0x1,0x3e(%rbx)
   0x005856e2 <+66>:movl   $0x12345678,0x28(%rbx)
   0x005856e9 <+73>:movl   $0x1,0x38(%rbx)
   0x005856f0 <+80>:mov%cx,0x34(%rbx)
   0x005856f4 <+84>:mov%dx,0x32(%rbx)
   0x005856f8 <+88>:mov%r8w,0x36(%rbx)
   0x005856fd <+93>:mov%r9,0x10(%rbx)
   0x00585701 <+97>:mov%al,0x19(%rbx)
   0x00585704 <+100>:   mov%al,0x18(%rbx)
   0x00585707 <+103>:   movq   $0x0,0x1c(%rbx)
   0x0058570f <+111>:   cmpb   $0x0,0xc2e938(%rsi)
   0x00585716 <+118>:   je 0x585721 
   0x00585718 <+120>:   movb   $0x1,0x24(%rbx)
   0x0058571c <+124>:   movb   $0x1,0x24(%r9)
   0x00585721 <+129>:   lea0x100(%rbx),%r12
---Type  to continue, or q  to quit---
   0x00585728 <+136>:   mov%r12,%rdi
   0x0058572b <+139>:   callq  0x590320 
   0x00585730 <+144>:   mov0x6a8449(%rip),%rdi# 0xc2db80

   0x00585737 <+151>:   callq  0x4c5be0 
   0x0058573c <+156>:   mov0x28(%rbx),%eax
   0x0058573f <+159>:   mov%r12,%rdi
   0x00585742 <+162>:   vpxor  %xmm0,%xmm0,%xmm0
   0x00585746 <+166>:   movb   $0x0,0x30(%rbx)
   0x0058574a <+170>:   movq   $0x0,0xc0(%rbx)
   0x00585755 <+181>:   movq   $0x0,0xc8(%rbx)
   0x00585760 <+192>:   mov%eax,0x2c(%rbx)
=> 0x00585763 <+195>:   vmovdqa %ymm0,0x40(%rbx) << crash here
   0x00585768 <+200>:   vmovdqa %ymm0,0x60(%rbx)
   0x0058576d <+205>:   vmovdqa %ymm0,0x80(%rbx)
   0x00585775 <+213>:   vmovdqa %ymm0,0xa0(%rbx)

[Bug c++/91043] GCC produces unaligned vmovdqa vector data access

2019-07-01 Thread hhaim at cisco dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91043

Hanoch Haim  changed:

   What|Removed |Added

 Target||x86
   Host||x86

--- Comment #1 from Hanoch Haim  ---
/usr/local/gcc-7.4/bin/gcc -v
Using built-in specs.
COLLECT_GCC=/usr/local/gcc-7.4/bin/gcc
COLLECT_LTO_WRAPPER=/usr/local/gcc-7.4/libexec/gcc/x86_64-pc-linux-gnu/7.4.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ./configure --disable-multilib --enable-languages=c,c++
--prefix=/usr/local/gcc-7.4
Thread model: posix
gcc version 7.4.0 (GCC) 
[csi

[Bug c++/91043] GCC produces unaligned vmovdqa vector data access

2019-07-01 Thread hhaim at cisco dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91043

--- Comment #2 from Hanoch Haim  ---
/usr/local/gcc-8.3/bin/gcc -v
Using built-in specs.
COLLECT_GCC=/usr/local/gcc-8.3/bin/gcc
COLLECT_LTO_WRAPPER=/usr/local/gcc-8.3/libexec/gcc/x86_64-pc-linux-gnu/8.3.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ./configure --disable-multilib --enable-languages=c,c++
--prefix=/usr/local/gcc-8.3
Thread model: posix
gcc version 8.3.0 (GCC) 
[csi-kiwi-03]>

[Bug c++/91043] GCC produces unaligned vmovdqa vector data access

2019-07-01 Thread hhaim at cisco dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91043

--- Comment #3 from Hanoch Haim  ---
With Ubuntu gcc7.4 package, there is no bug. 
I've built the gcc from source and it has an issue. There are a diffrent
configuration values

[Bug c++/91043] GCC produces unaligned vmovdqa vector data access

2019-07-01 Thread hhaim at cisco dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91043

--- Comment #5 from Hanoch Haim  ---
It was fast. 

The way to build are here

https://github.com/cisco-system-traffic-generator/trex-core/wiki#how-to-build-trex

```
$ git clone g...@github.com:cisco-system-traffic-generator/trex-core.git
$cd linux_dpdk
$./b configure  
$./b build

```


with gcc 7.x/8.x only this function are with wrong optimization 


if anything else is needed I would provide it

[Bug target/91043] GCC produces unaligned vmovdqa vector data access

2019-07-01 Thread hhaim at cisco dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91043

--- Comment #7 from Hanoch Haim  ---
Created attachment 46541
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46541&action=edit
stateful_rx_core.ii

compress ii

[Bug target/91043] GCC produces unaligned vmovdqa vector data access

2019-07-01 Thread hhaim at cisco dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91043

--- Comment #8 from Hanoch Haim  ---
Created attachment 46542
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46542&action=edit
stateful_rx_core.ss

[Bug target/91043] GCC produces unaligned vmovdqa vector data access

2019-07-01 Thread hhaim at cisco dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91043

--- Comment #9 from Hanoch Haim  ---
Attached. I hope this is what you are looking for.

[Bug target/91043] GCC produces unaligned vmovdqa vector data access

2019-07-01 Thread hhaim at cisco dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91043

--- Comment #11 from Hanoch Haim  ---
thanks for the quick answer. 
The parent object is static (bss) and wasn't dynmicly allocated using
new/malloc. 
gcc set the address of the parent object and the childs. 

Is there a way to solve it without removing the alignment?

[Bug target/91043] GCC produces unaligned vmovdqa vector data access

2019-07-01 Thread hhaim at cisco dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91043

Hanoch Haim  changed:

   What|Removed |Added

 Status|RESOLVED|REOPENED
 Resolution|INVALID |---

--- Comment #12 from Hanoch Haim  ---
Removing __rte_cache_aligned does not solve the issue


diff --git a/src/time_histogram.h b/src/time_histogram.h
index 07e66b49..26a37248 100755
--- a/src/time_histogram.h
+++ b/src/time_histogram.h
@@ -133,10 +133,10 @@ private:
 uint32_t m_win_cnt;
 uint32_t m_hot_max;
 dsec_t   m_max_ar[HISTOGRAM_QUEUE_SIZE]; // Array of maximum latencies for
previous periods
-uint64_t m_hcnt[HISTOGRAM_SIZE_LOG][HISTOGRAM_SIZE] __rte_cache_aligned ;
+uint64_t m_hcnt[HISTOGRAM_SIZE_LOG][HISTOGRAM_SIZE]  ;
 // Hdr histogram instance
 hdr_histogram *m_hdrh;
-};
+} __rte_cache_aligned;

[Bug target/91043] GCC produces unaligned vmovdqa vector data access

2019-07-01 Thread hhaim at cisco dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91043

--- Comment #13 from Hanoch Haim  ---
One more thing, The parent object is defined with 64Byte alignment 

class CGlobalTRex  {
..

} __rte_cache_aligned;

static CGlobalTRex  trex;

[Bug target/91043] GCC produces unaligned vmovdqa vector data access

2019-07-01 Thread hhaim at cisco dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91043

--- Comment #16 from Hanoch Haim  ---
The global/parent object CGlobalTRex is aligned (64B) as expected:

(gdb) p &g_trex
$1 = (CGlobalTRex *) 0xc365c0 

Could you explain why it is a problem to define the internal objects with the
aligment like the parent (64B)?

[Bug target/91043] GCC produces unaligned vmovdqa vector data access

2019-07-01 Thread hhaim at cisco dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91043

--- Comment #19 from Hanoch Haim  ---
After some investigation, I think it is not a gcc issue,  please verify. 
One of the internal object does not include a 64B alignment.

#define __rte_cache_aligned __attribute__((__aligned__(64)));

class CTimeHistogram {

} __rte_cache_aligned;


class CCPortLatency {
public:
 CTimeHistogram  m_hist;  
} __rte_cache_aligned;  <<= without this, it is not aligned while the code
generation assumed it is aligned !

class Root {

CCPortLatency port;

} __rte_cache_aligned;


Is it valid? why the code generation assumed the CCPortLatency is aligned
because one of its internal is aligned?

[Bug target/91043] GCC produces unaligned vmovdqa vector data access

2019-07-01 Thread hhaim at cisco dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91043

--- Comment #20 from Hanoch Haim  ---
One more thing. I would expect that the issue would be in CTimeHistogram
functions (defined as aligned) but the code generation issue was in the parent
object ( CCPortLatency) 
Why the compiler assumed that if one of the internal objects is defined as
aligned the  parent is aligned too?

[Bug target/91043] GCC produces unaligned vmovdqa vector data access

2019-07-02 Thread hhaim at cisco dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91043

--- Comment #22 from Hanoch Haim  ---

"Of course it does, because without aligning the container you cannot have
aligned members.  Maximum alignment always propagates outwards."

Sorry, your answer is still not  clear, so let give a short example
In this case there is a discrepancy betwean two gcc modules  

1. The module that generates the code think that it is aligned (CCPortLatency)
2. However the linker puts it in a none aligned location  


"
class CTimeHistogram {

} __rte_cache_aligned;

class CCPortLatency {
public:
 CTimeHistogram  m_hist;  
}; 
class Root {

CCPortLatency port;

} __rte_cache_aligned;

static Root root; 
"

In this case can I expect root.port to be aligned because its child (m_hist)
was defined as aligned and it propogate? Or should I explicitly ask both to be
aligned?

[Bug target/91043] GCC produces unaligned vmovdqa vector data access

2019-07-02 Thread hhaim at cisco dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91043

Hanoch Haim  changed:

   What|Removed |Added

 Status|REOPENED|RESOLVED
 Resolution|--- |INVALID

--- Comment #25 from Hanoch Haim  ---
Hi Richard,

You were right all along. I've looked into the wrong place!
I understand it now and it is not a gcc issue. gcc7/8 are just better than gcc
6 with code generation.  

1. The alignment is contagious, gcc marks all the parent objects of such an
object as aligned.  

2. With static allocated object there is no issue. 

3. The issue in my case was a dynamic allocation of a different object that
includes the aligned object. The object(parent) is assumed to be aligned, but
was allocated dynamically (not aligned)  


Thank you for the explanation.