I'm not seeing `@fastmath` apply fma/muladd. I rebuilt the sysimg and now I get results where g and h apply muladd/fma in the native code, but a new function k which is `@fastmath` inside of f does not apply muladd/fma.
https://gist.github.com/ChrisRackauckas/b239e33b4b52bcc28f3922c673a25910 Should I open an issue? Note that this is on v0.6 Windows. On Linux the sysimg isn't rebuilding for some reason, so I may need to just build from source. On Wednesday, September 21, 2016 at 6:22:06 AM UTC-7, Erik Schnetter wrote: > > On Wed, Sep 21, 2016 at 1:56 AM, Chris Rackauckas <[email protected] > <javascript:>> wrote: > >> Hi, >> First of all, does LLVM essentially fma or muladd expressions like >> `a1*x1 + a2*x2 + a3*x3 + a4*x4`? Or is it required that one explicitly use >> `muladd` and `fma` on these types of instructions (is there a macro for >> making this easier)? >> > > Yes, LLVM will use fma machine instructions -- but only if they lead to > the same round-off error as using separate multiply and add instructions. > If you do not care about the details of conforming to the IEEE standard, > then you can use the `@fastmath` macro that enables several optimizations, > including this one. This is described in the manual < > http://docs.julialang.org/en/release-0.5/manual/performance-tips/#performance-annotations > >. > > > Secondly, I am wondering if my setup is no applying these operations >> correctly. Here's my test code: >> >> f(x) = 2.0x + 3.0 >> g(x) = muladd(x,2.0, 3.0) >> h(x) = fma(x,2.0, 3.0) >> >> @code_llvm f(4.0) >> @code_llvm g(4.0) >> @code_llvm h(4.0) >> >> @code_native f(4.0) >> @code_native g(4.0) >> @code_native h(4.0) >> >> *Computer 1* >> >> Julia Version 0.5.0-rc4+0 >> Commit 9c76c3e* (2016-09-09 01:43 UTC) >> Platform Info: >> System: Linux (x86_64-redhat-linux) >> CPU: Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz >> WORD_SIZE: 64 >> BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell) >> LAPACK: libopenblasp.so.0 >> LIBM: libopenlibm >> LLVM: libLLVM-3.7.1 (ORCJIT, broadwell) >> > > This looks good, the "broadwell" architecture that LLVM uses should imply > the respective optimizations. Try with `@fastmath`. > > -erik > > > > > >> (the COPR nightly on CentOS7) with >> >> [crackauc@crackauc2 ~]$ lscpu >> Architecture: x86_64 >> CPU op-mode(s): 32-bit, 64-bit >> Byte Order: Little Endian >> CPU(s): 16 >> On-line CPU(s) list: 0-15 >> Thread(s) per core: 1 >> Core(s) per socket: 8 >> Socket(s): 2 >> NUMA node(s): 2 >> Vendor ID: GenuineIntel >> CPU family: 6 >> Model: 79 >> Model name: Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz >> Stepping: 1 >> CPU MHz: 1200.000 >> BogoMIPS: 6392.58 >> Virtualization: VT-x >> L1d cache: 32K >> L1i cache: 32K >> L2 cache: 256K >> L3 cache: 25600K >> NUMA node0 CPU(s): 0-7 >> NUMA node1 CPU(s): 8-15 >> >> >> >> I get the output >> >> define double @julia_f_72025(double) #0 { >> top: >> %1 = fmul double %0, 2.000000e+00 >> %2 = fadd double %1, 3.000000e+00 >> ret double %2 >> } >> >> define double @julia_g_72027(double) #0 { >> top: >> %1 = call double @llvm.fmuladd.f64(double %0, double 2.000000e+00, >> double 3.000000e+00) >> ret double %1 >> } >> >> define double @julia_h_72029(double) #0 { >> top: >> %1 = call double @llvm.fma.f64(double %0, double 2.000000e+00, double >> 3.000000e+00) >> ret double %1 >> } >> .text >> Filename: fmatest.jl >> pushq %rbp >> movq %rsp, %rbp >> Source line: 1 >> addsd %xmm0, %xmm0 >> movabsq $139916162906520, %rax # imm = 0x7F40C5303998 >> addsd (%rax), %xmm0 >> popq %rbp >> retq >> nopl (%rax,%rax) >> .text >> Filename: fmatest.jl >> pushq %rbp >> movq %rsp, %rbp >> Source line: 2 >> addsd %xmm0, %xmm0 >> movabsq $139916162906648, %rax # imm = 0x7F40C5303A18 >> addsd (%rax), %xmm0 >> popq %rbp >> retq >> nopl (%rax,%rax) >> .text >> Filename: fmatest.jl >> pushq %rbp >> movq %rsp, %rbp >> movabsq $139916162906776, %rax # imm = 0x7F40C5303A98 >> Source line: 3 >> movsd (%rax), %xmm1 # xmm1 = mem[0],zero >> movabsq $139916162906784, %rax # imm = 0x7F40C5303AA0 >> movsd (%rax), %xmm2 # xmm2 = mem[0],zero >> movabsq $139925776008800, %rax # imm = 0x7F43022C8660 >> popq %rbp >> jmpq *%rax >> nopl (%rax) >> >> It looks like explicit muladd or not ends up at the same native code, but >> is that native code actually doing an fma? The fma native is different, but >> from a discussion on the Gitter it seems that might be a software FMA? This >> computer is setup with the BIOS setting as LAPACK optimized or something >> like that, so is that messing with something? >> >> *Computer 2* >> >> Julia Version 0.6.0-dev.557 >> Commit c7a4897 (2016-09-08 17:50 UTC) >> Platform Info: >> System: NT (x86_64-w64-mingw32) >> CPU: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz >> WORD_SIZE: 64 >> BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell) >> LAPACK: libopenblas64_ >> LIBM: libopenlibm >> LLVM: libLLVM-3.7.1 (ORCJIT, haswell) >> >> >> on a 4770k i7, Windows 10, I get the output >> >> ; Function Attrs: uwtable >> define double @julia_f_66153(double) #0 { >> top: >> %1 = fmul double %0, 2.000000e+00 >> %2 = fadd double %1, 3.000000e+00 >> ret double %2 >> } >> >> ; Function Attrs: uwtable >> define double @julia_g_66157(double) #0 { >> top: >> %1 = call double @llvm.fmuladd.f64(double %0, double 2.000000e+00, >> double 3.000000e+00) >> ret double %1 >> } >> >> ; Function Attrs: uwtable >> define double @julia_h_66158(double) #0 { >> top: >> %1 = call double @llvm.fma.f64(double %0, double 2.000000e+00, double >> 3.000000e+00) >> ret double %1 >> } >> .text >> Filename: console >> pushq %rbp >> movq %rsp, %rbp >> Source line: 1 >> addsd %xmm0, %xmm0 >> movabsq $534749456, %rax # imm = 0x1FDFA110 >> addsd (%rax), %xmm0 >> popq %rbp >> retq >> nopl (%rax,%rax) >> .text >> Filename: console >> pushq %rbp >> movq %rsp, %rbp >> Source line: 2 >> addsd %xmm0, %xmm0 >> movabsq $534749584, %rax # imm = 0x1FDFA190 >> addsd (%rax), %xmm0 >> popq %rbp >> retq >> nopl (%rax,%rax) >> .text >> Filename: console >> pushq %rbp >> movq %rsp, %rbp >> movabsq $534749712, %rax # imm = 0x1FDFA210 >> Source line: 3 >> movsd dcabs164_(%rax), %xmm1 # xmm1 = mem[0],zero >> movabsq $534749720, %rax # imm = 0x1FDFA218 >> movsd (%rax), %xmm2 # xmm2 = mem[0],zero >> movabsq $fma, %rax >> popq %rbp >> jmpq *%rax >> nop >> >> This seems to be similar to the first result. >> >> > > > -- > Erik Schnetter <[email protected] <javascript:>> > http://www.perimeterinstitute.ca/personal/eschnetter/ >
