[Bug tree-optimization/57642] New: vectorizer not working with function templates

2013-06-18 Thread yzhang1985 at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57642

Bug ID: 57642
   Summary: vectorizer not working with function templates
   Product: gcc
   Version: 4.8.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yzhang1985 at gmail dot com

Hi, the following simple loop doesn't vectorize in GCC 4.8.1, but does with
4.3.2. It does vectorize if I make DoIt a regular function instead of a
templated function.


#include 
#include 
#include 
#include 
#include 
#include 


class SqrtFunc
{
public:
  float operator()(float x)
  {
return (((3.02f * x) + 1.5f) * x - 2.1f) * x + 1.5f;
  }
};

template 
void DoIt(float *data, int size, Functor functor)
{
  for (int i = 0; i < size; ++i)
  {
data[i] = functor(data[i]);
  }
}


int main()
{
  float data[2048];
  SqrtFunc functor;
  DoIt(data, sizeof(data), functor);
  return 0;
}


[Bug tree-optimization/57642] vectorizer not working with function templates

2013-06-18 Thread yzhang1985 at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57642

--- Comment #1 from Yale Zhang  ---
I would like to know if there's an easy work around for this.


[Bug tree-optimization/57642] vectorizer not working with function templates

2013-06-18 Thread yzhang1985 at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57642

Yale Zhang  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #2 from Yale Zhang  ---
Sorry, please close this. My loop was eliminated as dead code, thus no
vectorization. I saw the message not enough data-refs for auto-vectorization,
which made me think it wasn't being vectorized, but that's probably from
somewhere else.


[Bug java/83647] add x86_64 Windows support to GCJ

2018-01-01 Thread yzhang1985 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83647

--- Comment #2 from Yale Zhang  ---
(In reply to Andrew Pinski from comment #1)
> GCC 6 is in regression only fixes due to it being a release branch.
> 
> Won't fix as Java was removed from GCC 7.  There are other open source Java
> implementations including but not limited to OpenJDK.

I was afraid of that, but I want to compile to native code and AFAIK, GCJ is
the only one or if not, the only robust one. I think this change can be useful
to others and shouldn't be lost just because GCC 6 is limited to regression
fixes only. Is there a non-release branch that this can checked into?

[Bug java/83647] New: add x86_64 Windows support to GCJ

2018-01-01 Thread yzhang1985 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83647

Bug ID: 83647
   Summary: add x86_64 Windows support to GCJ
   Product: gcc
   Version: 6.4.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: java
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yzhang1985 at gmail dot com
  Target Milestone: ---

Created attachment 43002
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43002&action=edit
The changes to natVMConsole.cc are only for MingW. Probably these changes don't
need to be in GCC and can be downstream

GCJ currently doesn't support x86_64 Windows, specifically x86_64-w64-mingw32. 

I have made a patch that supports it. I know GCJ has been removed from GCC 7,
but I'm hoping this can still make it into GCC 6.


The changes were mostly to change 32bit ints to pointer sized ints.
Specifically,

unsigned long -> uintptr_t
jint -> jlong

Changing jint -> jlong is probably not right because that would change 32bit
builds. I wanted to change those jint to jsize and change jsize from int to
intptr_t, but wasn't sure of the effects.

The other big change was I had to replace boehm-gc with a newer version (7.2e,
7.2g crashes). GCC seems to have an out of date, custom version that doesn't
support x86_64 windows, and that only builds a static lib. I didn't include
that in the patch, but you can simply plant the newer version into the source
code.

[Bug tree-optimization/80647] New: vectorized loop crashes from wrongly assuming 16 byte alignment

2017-05-05 Thread yzhang1985 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80647

Bug ID: 80647
   Summary: vectorized loop crashes from wrongly assuming 16 byte
alignment
   Product: gcc
   Version: 6.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yzhang1985 at gmail dot com
  Target Milestone: ---

Created attachment 41328
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41328&action=edit
compiling with -O3 will reproduce the crash

I'm getting a crash for a function that extracts a sub region of an image
in-place. I compile with gcc -O3, which vectorizes the inner most loop,

while (twd--)
{
  *pintdest++ = *pintsrc++;
}


---assembly-
movdqa (%r10,%rax,1),%xmm0
add$0x1,%ecx
movups %xmm0,(%rdx,%rax,1)


It crashes on movdqa because the address isn't aligned. It should be using
unaligned vector loads like movdqu or lddqu instead.

I tested it with GCC 4.8 which did vectorize the loop correctly.


Starting with Nehalem, there is no penalty for using unaligned loads/stores if
the vector doesn't span 2 cache lines, so why not always generate unaligned
loads/stores? 

It used to be that the other advantage to exploit for aligned data was to fuse
the vector load/store with another instruction, reducing machine code size. But
even that alignment restriction for memory operands was relaxed starting with
SandyBridge's VEX instructions.

[Bug tree-optimization/80647] vectorized loop crashes from wrongly assuming 16 byte alignment

2017-05-08 Thread yzhang1985 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80647

--- Comment #2 from Yale Zhang  ---
Very interesting case. First, I didn't know unaligned loads were undefined
behavior on x86.

ICC 17 doesn't vectorize the loop probably because the destination and source
of the memmove() alias.

But apparently GCC knows how to vectorize memmove(). In this function, the
destination always comes before the source, so it's trivial to vectorize.
Vectorizing the case where destination > source is harder, and I wonder if GCC
can do that.


This is some legacy code from > 10 years ago. Manually vectorizing the
memmove() was too smart for modern compilers.

But the solution is simple. I'll just use the other simple, fallback
implementation used on unknown platforms. It's still vectorizable though.

thanks Andrew.

[Bug inline-asm/77756] New: cpuid

2016-09-27 Thread yzhang1985 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77756

Bug ID: 77756
   Summary: cpuid
   Product: gcc
   Version: 6.2.0
Status: UNCONFIRMED
  Severity: major
  Priority: P3
 Component: inline-asm
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yzhang1985 at gmail dot com
  Target Milestone: ---

Created attachment 39696
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39696&action=edit
should print 99 when run on an AVX2 capable processor

I've found a bug in __get_cpuid() in the compiler internal header, cpuid.h.

I wish to detect if the CPU supports AVX2, but when I call __get_cpuid(7, ...),
EBX is all zeros. The problem is for level 7, ECX must be set to 0 before
calling cpuid.

As a work around, I've added "xor %%ecx, %%ecx" to __cpuid() and that took care
of the problem:

#define __cpuid(level, a, b, c, d) \
  __asm__("xor %%ecx, %%ecx\n" \
  "cpuid\n" \
  : "=a"(a), "=b"(b), "=c"(c), "=d"(d) \
  : "0"(level))

It looks like Intel only started requiring this for level 7. Who knows if
they'll require setting ECX to other values for future levels, but for know, it
seems always setting ECX to 0 is OK.

One mystery is why GCC's builtin AVX2 auto detection for function
multiversioning, which uses __get_cpuid() works (see multiversioning.cpp).
However, I can't use multiversioning because ifunc hasn't been ported to
Windows, so I have to do manual detection.

I don't think attaching a preprocessed C file is necessary to reproduce this.
Here's the output of gcc -v:

Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/6/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 6.2.0-4'
--with-bugurl=file:///usr/share/doc/gcc-6/README.Bugs
--enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr
--program-suffix=-6 --program-prefix=x86_64-linux-gnu- --enable-shared
--enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext
--enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/
--enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes
--with-default-libstdcxx-abi=new --enable-gnu-unique-object
--disable-vtable-verify --enable-libmpx --enable-plugin --with-system-zlib
--disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo
--with-java-home=/usr/lib/jvm/java-1.5.0-gcj-6-amd64/jre --enable-java-home
--with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-6-amd64
--with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-6-amd64
--with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar
--enable-objc-gc --enable-multiarch --with-arch-32=i686 --with-abi=m64
--with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic
--enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu
--target=x86_64-linux-gnu
Thread model: posix
gcc version 6.2.0 20160914 (Debian 6.2.0-4)

[Bug other/77769] New: function generated for OpenMP region uses wrong instruction set

2016-09-27 Thread yzhang1985 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77769

Bug ID: 77769
   Summary: function generated for OpenMP region uses wrong
instruction set
   Product: gcc
   Version: 6.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: other
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yzhang1985 at gmail dot com
  Target Milestone: ---

Created attachment 39707
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39707&action=edit
compilation will fail with "target specific option mismatch"

Greetings, I'm trying write AVX2 SIMD intrinsics code that will be dynamically
dispatched at runtime through a function pointer. The code has to work for
vanilla x86_64 processors, so I can't use -mavx2.

Instead, I use #pragma GCC target("avx2") to target AVX2 for selected
functions. The bug is that whenever I call AVX or AVX2 intrinsics inside an
OpenMP region, I get the error, "target specific option mismatch"

If I move the intrinsics code to another function, it can compile, but if I
mark that function with __attribute__((always_inline)), the compilation fails
with the same error.

So, my conclusion is that the OpenMP code generator is still targeting vanilla
x86_64, instead of AVX2. Appreciate it if someone can work on fixing this.

command line:
g++ -O3 -fopenmp openmp_wrong_target_isa.cpp


Sorry, I couldn't include the preprocessed file as requested - exceeds upload
limit.

gcc -v output:

Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/6/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 6.2.0-4'
--with-bugurl=file:///usr/share/doc/gcc-6/README.Bugs
--enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr
--program-suffix=-6 --program-prefix=x86_64-linux-gnu- --enable-shared
--enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext
--enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/
--enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes
--with-default-libstdcxx-abi=new --enable-gnu-unique-object
--disable-vtable-verify --enable-libmpx --enable-plugin --with-system-zlib
--disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo
--with-java-home=/usr/lib/jvm/java-1.5.0-gcj-6-amd64/jre --enable-java-home
--with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-6-amd64
--with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-6-amd64
--with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar
--enable-objc-gc --enable-multiarch --with-arch-32=i686 --with-abi=m64
--with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic
--enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu
--target=x86_64-linux-gnu
Thread model: posix
gcc version 6.2.0 20160914 (Debian 6.2.0-4)

[Bug middle-end/77769] function generated for OpenMP region uses wrong instruction set

2016-09-28 Thread yzhang1985 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77769

--- Comment #3 from Yale Zhang  ---
(In reply to Richard Biener from comment #2)
> The testcase you attached can't work because we can't inline an avx2
> function into a function not having avx2 enabled.

Right, but main() and the OpenMP function should have AVX2 enabled because they
come after #pragma GCC target("avx2") which is still in effect.

If the target("avx2") was surrounded by #pragma GCC push_options/pop_options,
then main would not have AVX2 enabled

[Bug target/77756] __get_cpuid() returns wrong values for level 7 (extended features)

2016-09-28 Thread yzhang1985 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77756

--- Comment #2 from Yale Zhang  ---
(In reply to Uroš Bizjak from comment #1)
> Created attachment 39711 [details]
> Patch that fixes __get_cpuid
> 
> Can you please check if the attached patch fixes your problem?

Great, your patch works. Thanks for taking care of it so quickly.

I see you made it flexible by setting ECX to 0 only for certain levels, without
increasing machine code size since __get_cpuid() is inlined and most of the
unused cases will get thrown away as dead code.

But does level 13 really exist? I don't see any documentation for it.

Also, any idea why the AVX2 auto detection used for function multiversioning
was working earlier, which used __get_cpuid()? Was it just by chance?

[Bug target/77756] __get_cpuid() returns wrong values for level 7 (extended features)

2016-09-30 Thread yzhang1985 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77756

--- Comment #12 from Yale Zhang  ---
What's the purpose of subleaf? Is it to distinguish the capabilities of
different cores in a heterogeneous chip (e.g. ARM big-little)?

Then I would be fine with making this an extra parameter to __get_cpuid(). 

Microsoft has a __cpuidex() function that also takes subleaf, but for the
regular __cpuid(), the subleaf default to 0. Should __get_cpuid() default to 0
as well?


(In reply to uros from comment #11)
> Author: uros
> Date: Thu Sep 29 18:44:32 2016
> New Revision: 240629
> 
> URL: https://gcc.gnu.org/viewcvs?rev=240629&root=gcc&view=rev
> Log:
>   PR target/77756
>   * config/i386/cpuid.h (__get_cpuid_count): New.
>   (__get_cpuid): Rename __level to __leaf.
> 
> testsuite/ChangeLog:
> 
>   PR target/77756
>   * gcc.target/i386/pr77756.c: New test.
> 
> 
> Modified:
> trunk/gcc/ChangeLog
> trunk/gcc/config/i386/cpuid.h
> trunk/gcc/testsuite/ChangeLog
> trunk/gcc/testsuite/gcc.target/i386/pr77756.c

[Bug other/61417] New: can't use intrinsic function as argument to function template

2014-06-04 Thread yzhang1985 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61417

Bug ID: 61417
   Summary: can't use intrinsic function as argument to function
template
   Product: gcc
   Version: 4.9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: other
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yzhang1985 at gmail dot com

This program isn't compiling. It works in GCC 4.3, 4.8, and in the Intel
compiler. The problem is that GCC fails to inline the _mm_cmpgt_epi8 function
(which is not compiled into its own symbol), thinking it's defined externally.

#include 
#include 
#define FORCE_INLINE __attribute__ ((always_inline))

__m128i g_results;

typedef __m128i TwoOperandVectorFunction(__m128i, __m128i);
FORCE_INLINE void IntrinsicBench(TwoOperandVectorFunction f)
{
  __m128i r0, r1, r2;
  for (int i = 0; i < 20; i += 16)
  {
r0 = f(r1, r2);
  }
  g_results = r0;
}
int main(int argc, char **argv)
{
  IntrinsicBench(_mm_cmpgt_epi8);
  return 0;
}

I'm using GCC 4.9 (x86_64) configured with ./configure --prefix=/opt/gcc4.9
--with-gmp-include=/home/yale/gmp-5.1.2
--with-gmp-lib=/home/yale/gmp-5.1.2/.libs
--with-mpfr-include=/home/yale/mpfr-3.1.2/src
--with-mpfr-lib=/home/yale/mpfr-3.1.2/src/.libs
--with-mpc-include=/home/yale/mpc-1.0.1/src
--with-mpc-lib=/home/yale/mpc-1.0.1/src/.libs --enable-languages=c,c++,java
--with-multilib-list=m32,m64 --enable-libgcj --enable-libgcj-multifile
--enable-static-libjava --disable-java-awt --disable-libgcj-debug
--disable-jvmpi --disable-bootstrap --disable-nls --disable-multilib