Re: [julia-users] Is FMA/Muladd Working Here?

Yichao Yu Wed, 21 Sep 2016 19:11:55 -0700

On Wed, Sep 21, 2016 at 9:49 PM, Erik Schnetter <[email protected]> wrote:
> I confirm that I can't get Julia to synthesize a `vfmadd` instruction
> either... Sorry for sending you on a wild goose chase.


-march=haswell does the trick for C (both clang and gcc)
the necessary bit for the machine ir optimization (this is not a llvm
ir optimization pass) to do this is llc options -mcpu=haswell and
function attribute unsafe-fp-math=true.

>
> -erik
>
> On Wed, Sep 21, 2016 at 9:33 PM, Yichao Yu <[email protected]> wrote:
>>
>> On Wed, Sep 21, 2016 at 9:29 PM, Erik Schnetter <[email protected]>
>> wrote:
>> > On Wed, Sep 21, 2016 at 9:22 PM, Chris Rackauckas <[email protected]>
>> > wrote:
>> >>
>> >> I'm not seeing `@fastmath` apply fma/muladd. I rebuilt the sysimg and
>> >> now
>> >> I get results where g and h apply muladd/fma in the native code, but a
>> >> new
>> >> function k which is `@fastmath` inside of f does not apply muladd/fma.
>> >>
>> >>
>> >> https://gist.github.com/ChrisRackauckas/b239e33b4b52bcc28f3922c673a25910
>> >>
>> >> Should I open an issue?
>> >
>> >
>> > In your case, LLVM apparently thinks that `x + x + 3` is faster to
>> > calculate
>> > than `2x+3`. If you use a less round number than `2` multiplying `x`,
>> > you
>> > might see a different behaviour.
>>
>> I've personally never seen llvm create fma from mul and add. We might
>> not have the llvm passes enabled if LLVM is capable of doing this at
>> all.
>>
>> >
>> > -erik
>> >
>> >
>> >> Note that this is on v0.6 Windows. On Linux the sysimg isn't rebuilding
>> >> for some reason, so I may need to just build from source.
>> >>
>> >> On Wednesday, September 21, 2016 at 6:22:06 AM UTC-7, Erik Schnetter
>> >> wrote:
>> >>>
>> >>> On Wed, Sep 21, 2016 at 1:56 AM, Chris Rackauckas <[email protected]>
>> >>> wrote:
>> >>>>
>> >>>> Hi,
>> >>>>   First of all, does LLVM essentially fma or muladd expressions like
>> >>>> `a1*x1 + a2*x2 + a3*x3 + a4*x4`? Or is it required that one
>> >>>> explicitly use
>> >>>> `muladd` and `fma` on these types of instructions (is there a macro
>> >>>> for
>> >>>> making this easier)?
>> >>>
>> >>>
>> >>> Yes, LLVM will use fma machine instructions -- but only if they lead
>> >>> to
>> >>> the same round-off error as using separate multiply and add
>> >>> instructions. If
>> >>> you do not care about the details of conforming to the IEEE standard,
>> >>> then
>> >>> you can use the `@fastmath` macro that enables several optimizations,
>> >>> including this one. This is described in the manual
>> >>>
>> >>> <http://docs.julialang.org/en/release-0.5/manual/performance-tips/#performance-annotations>.
>> >>>
>> >>>
>> >>>>   Secondly, I am wondering if my setup is no applying these
>> >>>> operations
>> >>>> correctly. Here's my test code:
>> >>>>
>> >>>> f(x) = 2.0x + 3.0
>> >>>> g(x) = muladd(x,2.0, 3.0)
>> >>>> h(x) = fma(x,2.0, 3.0)
>> >>>>
>> >>>> @code_llvm f(4.0)
>> >>>> @code_llvm g(4.0)
>> >>>> @code_llvm h(4.0)
>> >>>>
>> >>>> @code_native f(4.0)
>> >>>> @code_native g(4.0)
>> >>>> @code_native h(4.0)
>> >>>>
>> >>>> Computer 1
>> >>>>
>> >>>> Julia Version 0.5.0-rc4+0
>> >>>> Commit 9c76c3e* (2016-09-09 01:43 UTC)
>> >>>> Platform Info:
>> >>>>   System: Linux (x86_64-redhat-linux)
>> >>>>   CPU: Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
>> >>>>   WORD_SIZE: 64
>> >>>>   BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
>> >>>>   LAPACK: libopenblasp.so.0
>> >>>>   LIBM: libopenlibm
>> >>>>   LLVM: libLLVM-3.7.1 (ORCJIT, broadwell)
>> >>>
>> >>>
>> >>> This looks good, the "broadwell" architecture that LLVM uses should
>> >>> imply
>> >>> the respective optimizations. Try with `@fastmath`.
>> >>>
>> >>> -erik
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>>
>> >>>> (the COPR nightly on CentOS7) with
>> >>>>
>> >>>> [crackauc@crackauc2 ~]$ lscpu
>> >>>> Architecture:          x86_64
>> >>>> CPU op-mode(s):        32-bit, 64-bit
>> >>>> Byte Order:            Little Endian
>> >>>> CPU(s):                16
>> >>>> On-line CPU(s) list:   0-15
>> >>>> Thread(s) per core:    1
>> >>>> Core(s) per socket:    8
>> >>>> Socket(s):             2
>> >>>> NUMA node(s):          2
>> >>>> Vendor ID:             GenuineIntel
>> >>>> CPU family:            6
>> >>>> Model:                 79
>> >>>> Model name:            Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
>> >>>> Stepping:              1
>> >>>> CPU MHz:               1200.000
>> >>>> BogoMIPS:              6392.58
>> >>>> Virtualization:        VT-x
>> >>>> L1d cache:             32K
>> >>>> L1i cache:             32K
>> >>>> L2 cache:              256K
>> >>>> L3 cache:              25600K
>> >>>> NUMA node0 CPU(s):     0-7
>> >>>> NUMA node1 CPU(s):     8-15
>> >>>>
>> >>>>
>> >>>>
>> >>>> I get the output
>> >>>>
>> >>>> define double @julia_f_72025(double) #0 {
>> >>>> top:
>> >>>>   %1 = fmul double %0, 2.000000e+00
>> >>>>   %2 = fadd double %1, 3.000000e+00
>> >>>>   ret double %2
>> >>>> }
>> >>>>
>> >>>> define double @julia_g_72027(double) #0 {
>> >>>> top:
>> >>>>   %1 = call double @llvm.fmuladd.f64(double %0, double 2.000000e+00,
>> >>>> double 3.000000e+00)
>> >>>>   ret double %1
>> >>>> }
>> >>>>
>> >>>> define double @julia_h_72029(double) #0 {
>> >>>> top:
>> >>>>   %1 = call double @llvm.fma.f64(double %0, double 2.000000e+00,
>> >>>> double
>> >>>> 3.000000e+00)
>> >>>>   ret double %1
>> >>>> }
>> >>>> .text
>> >>>> Filename: fmatest.jl
>> >>>> pushq %rbp
>> >>>> movq %rsp, %rbp
>> >>>> Source line: 1
>> >>>> addsd %xmm0, %xmm0
>> >>>> movabsq $139916162906520, %rax  # imm = 0x7F40C5303998
>> >>>> addsd (%rax), %xmm0
>> >>>> popq %rbp
>> >>>> retq
>> >>>> nopl (%rax,%rax)
>> >>>> .text
>> >>>> Filename: fmatest.jl
>> >>>> pushq %rbp
>> >>>> movq %rsp, %rbp
>> >>>> Source line: 2
>> >>>> addsd %xmm0, %xmm0
>> >>>> movabsq $139916162906648, %rax  # imm = 0x7F40C5303A18
>> >>>> addsd (%rax), %xmm0
>> >>>> popq %rbp
>> >>>> retq
>> >>>> nopl (%rax,%rax)
>> >>>> .text
>> >>>> Filename: fmatest.jl
>> >>>> pushq %rbp
>> >>>> movq %rsp, %rbp
>> >>>> movabsq $139916162906776, %rax  # imm = 0x7F40C5303A98
>> >>>> Source line: 3
>> >>>> movsd (%rax), %xmm1           # xmm1 = mem[0],zero
>> >>>> movabsq $139916162906784, %rax  # imm = 0x7F40C5303AA0
>> >>>> movsd (%rax), %xmm2           # xmm2 = mem[0],zero
>> >>>> movabsq $139925776008800, %rax  # imm = 0x7F43022C8660
>> >>>> popq %rbp
>> >>>> jmpq *%rax
>> >>>> nopl (%rax)
>> >>>>
>> >>>> It looks like explicit muladd or not ends up at the same native code,
>> >>>> but is that native code actually doing an fma? The fma native is
>> >>>> different,
>> >>>> but from a discussion on the Gitter it seems that might be a software
>> >>>> FMA?
>> >>>> This computer is setup with the BIOS setting as LAPACK optimized or
>> >>>> something like that, so is that messing with something?
>> >>>>
>> >>>> Computer 2
>> >>>>
>> >>>> Julia Version 0.6.0-dev.557
>> >>>> Commit c7a4897 (2016-09-08 17:50 UTC)
>> >>>> Platform Info:
>> >>>>   System: NT (x86_64-w64-mingw32)
>> >>>>   CPU: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
>> >>>>   WORD_SIZE: 64
>> >>>>   BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
>> >>>>   LAPACK: libopenblas64_
>> >>>>   LIBM: libopenlibm
>> >>>>   LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
>> >>>>
>> >>>>
>> >>>> on a 4770k i7, Windows 10, I get the output
>> >>>>
>> >>>> ; Function Attrs: uwtable
>> >>>> define double @julia_f_66153(double) #0 {
>> >>>> top:
>> >>>>   %1 = fmul double %0, 2.000000e+00
>> >>>>   %2 = fadd double %1, 3.000000e+00
>> >>>>   ret double %2
>> >>>> }
>> >>>>
>> >>>> ; Function Attrs: uwtable
>> >>>> define double @julia_g_66157(double) #0 {
>> >>>> top:
>> >>>>   %1 = call double @llvm.fmuladd.f64(double %0, double 2.000000e+00,
>> >>>> double 3.000000e+00)
>> >>>>   ret double %1
>> >>>> }
>> >>>>
>> >>>> ; Function Attrs: uwtable
>> >>>> define double @julia_h_66158(double) #0 {
>> >>>> top:
>> >>>>   %1 = call double @llvm.fma.f64(double %0, double 2.000000e+00,
>> >>>> double
>> >>>> 3.000000e+00)
>> >>>>   ret double %1
>> >>>> }
>> >>>> .text
>> >>>> Filename: console
>> >>>> pushq %rbp
>> >>>> movq %rsp, %rbp
>> >>>> Source line: 1
>> >>>> addsd %xmm0, %xmm0
>> >>>> movabsq $534749456, %rax        # imm = 0x1FDFA110
>> >>>> addsd (%rax), %xmm0
>> >>>> popq %rbp
>> >>>> retq
>> >>>> nopl (%rax,%rax)
>> >>>> .text
>> >>>> Filename: console
>> >>>> pushq %rbp
>> >>>> movq %rsp, %rbp
>> >>>> Source line: 2
>> >>>> addsd %xmm0, %xmm0
>> >>>> movabsq $534749584, %rax        # imm = 0x1FDFA190
>> >>>> addsd (%rax), %xmm0
>> >>>> popq %rbp
>> >>>> retq
>> >>>> nopl (%rax,%rax)
>> >>>> .text
>> >>>> Filename: console
>> >>>> pushq %rbp
>> >>>> movq %rsp, %rbp
>> >>>> movabsq $534749712, %rax        # imm = 0x1FDFA210
>> >>>> Source line: 3
>> >>>> movsd dcabs164_(%rax), %xmm1  # xmm1 = mem[0],zero
>> >>>> movabsq $534749720, %rax        # imm = 0x1FDFA218
>> >>>> movsd (%rax), %xmm2           # xmm2 = mem[0],zero
>> >>>> movabsq $fma, %rax
>> >>>> popq %rbp
>> >>>> jmpq *%rax
>> >>>> nop
>> >>>>
>> >>>> This seems to be similar to the first result.
>> >>>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Erik Schnetter <[email protected]>
>> >>> http://www.perimeterinstitute.ca/personal/eschnetter/
>> >
>> >
>> >
>> >
>> > --
>> > Erik Schnetter <[email protected]>
>> > http://www.perimeterinstitute.ca/personal/eschnetter/
>
>
>
>
> --
> Erik Schnetter <[email protected]>
> http://www.perimeterinstitute.ca/personal/eschnetter/

Re: [julia-users] Is FMA/Muladd Working Here?

Reply via email to