Re: [julia-users] Is FMA/Muladd Working Here?

Chris Rackauckas Wed, 21 Sep 2016 18:23:10 -0700

I'm not seeing `@fastmath` apply fma/muladd. I rebuilt the sysimg and now I 
get results where g and h apply muladd/fma in the native code, but a new 
function k which is `@fastmath` inside of f does not apply muladd/fma.


https://gist.github.com/ChrisRackauckas/b239e33b4b52bcc28f3922c673a25910

Should I open an issue?

Note that this is on v0.6 Windows. On Linux the sysimg isn't rebuilding for 
some reason, so I may need to just build from source.

On Wednesday, September 21, 2016 at 6:22:06 AM UTC-7, Erik Schnetter wrote:
>
> On Wed, Sep 21, 2016 at 1:56 AM, Chris Rackauckas <[email protected] 
> <javascript:>> wrote:
>
>> Hi,
>>   First of all, does LLVM essentially fma or muladd expressions like 
>> `a1*x1 + a2*x2 + a3*x3 + a4*x4`? Or is it required that one explicitly use 
>> `muladd` and `fma` on these types of instructions (is there a macro for 
>> making this easier)?
>>
>
> Yes, LLVM will use fma machine instructions -- but only if they lead to 
> the same round-off error as using separate multiply and add instructions. 
> If you do not care about the details of conforming to the IEEE standard, 
> then you can use the `@fastmath` macro that enables several optimizations, 
> including this one. This is described in the manual <
> http://docs.julialang.org/en/release-0.5/manual/performance-tips/#performance-annotations
> >.
>
>
>   Secondly, I am wondering if my setup is no applying these operations 
>> correctly. Here's my test code:
>>
>> f(x) = 2.0x + 3.0
>> g(x) = muladd(x,2.0, 3.0)
>> h(x) = fma(x,2.0, 3.0)
>>
>> @code_llvm f(4.0)
>> @code_llvm g(4.0)
>> @code_llvm h(4.0)
>>
>> @code_native f(4.0)
>> @code_native g(4.0)
>> @code_native h(4.0)
>>
>> *Computer 1*
>>
>> Julia Version 0.5.0-rc4+0
>> Commit 9c76c3e* (2016-09-09 01:43 UTC)
>> Platform Info:
>>   System: Linux (x86_64-redhat-linux)
>>   CPU: Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
>>   WORD_SIZE: 64
>>   BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
>>   LAPACK: libopenblasp.so.0
>>   LIBM: libopenlibm
>>   LLVM: libLLVM-3.7.1 (ORCJIT, broadwell)
>>
>
> This looks good, the "broadwell" architecture that LLVM uses should imply 
> the respective optimizations. Try with `@fastmath`.
>
> -erik
>
>
>
>  
>
>> (the COPR nightly on CentOS7) with 
>>
>> [crackauc@crackauc2 ~]$ lscpu
>> Architecture:          x86_64
>> CPU op-mode(s):        32-bit, 64-bit
>> Byte Order:            Little Endian
>> CPU(s):                16
>> On-line CPU(s) list:   0-15
>> Thread(s) per core:    1
>> Core(s) per socket:    8
>> Socket(s):             2
>> NUMA node(s):          2
>> Vendor ID:             GenuineIntel
>> CPU family:            6
>> Model:                 79
>> Model name:            Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
>> Stepping:              1
>> CPU MHz:               1200.000
>> BogoMIPS:              6392.58
>> Virtualization:        VT-x
>> L1d cache:             32K
>> L1i cache:             32K
>> L2 cache:              256K
>> L3 cache:              25600K
>> NUMA node0 CPU(s):     0-7
>> NUMA node1 CPU(s):     8-15
>>
>>
>>
>> I get the output
>>
>> define double @julia_f_72025(double) #0 {
>> top:
>>   %1 = fmul double %0, 2.000000e+00
>>   %2 = fadd double %1, 3.000000e+00
>>   ret double %2
>> }
>>
>> define double @julia_g_72027(double) #0 {
>> top:
>>   %1 = call double @llvm.fmuladd.f64(double %0, double 2.000000e+00, 
>> double 3.000000e+00)
>>   ret double %1
>> }
>>
>> define double @julia_h_72029(double) #0 {
>> top:
>>   %1 = call double @llvm.fma.f64(double %0, double 2.000000e+00, double 
>> 3.000000e+00)
>>   ret double %1
>> }
>> .text
>> Filename: fmatest.jl
>> pushq %rbp
>> movq %rsp, %rbp
>> Source line: 1
>> addsd %xmm0, %xmm0
>> movabsq $139916162906520, %rax  # imm = 0x7F40C5303998
>> addsd (%rax), %xmm0
>> popq %rbp
>> retq
>> nopl (%rax,%rax)
>> .text
>> Filename: fmatest.jl
>> pushq %rbp
>> movq %rsp, %rbp
>> Source line: 2
>> addsd %xmm0, %xmm0
>> movabsq $139916162906648, %rax  # imm = 0x7F40C5303A18
>> addsd (%rax), %xmm0
>> popq %rbp
>> retq
>> nopl (%rax,%rax)
>> .text
>> Filename: fmatest.jl
>> pushq %rbp
>> movq %rsp, %rbp
>> movabsq $139916162906776, %rax  # imm = 0x7F40C5303A98
>> Source line: 3
>> movsd (%rax), %xmm1           # xmm1 = mem[0],zero
>> movabsq $139916162906784, %rax  # imm = 0x7F40C5303AA0
>> movsd (%rax), %xmm2           # xmm2 = mem[0],zero
>> movabsq $139925776008800, %rax  # imm = 0x7F43022C8660
>> popq %rbp
>> jmpq *%rax
>> nopl (%rax)
>>
>> It looks like explicit muladd or not ends up at the same native code, but 
>> is that native code actually doing an fma? The fma native is different, but 
>> from a discussion on the Gitter it seems that might be a software FMA? This 
>> computer is setup with the BIOS setting as LAPACK optimized or something 
>> like that, so is that messing with something?
>>
>> *Computer 2*
>>
>> Julia Version 0.6.0-dev.557
>> Commit c7a4897 (2016-09-08 17:50 UTC)
>> Platform Info:
>>   System: NT (x86_64-w64-mingw32)
>>   CPU: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
>>   WORD_SIZE: 64
>>   BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
>>   LAPACK: libopenblas64_
>>   LIBM: libopenlibm
>>   LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
>>
>>
>> on a 4770k i7, Windows 10, I get the output
>>
>> ; Function Attrs: uwtable
>> define double @julia_f_66153(double) #0 {
>> top:
>>   %1 = fmul double %0, 2.000000e+00
>>   %2 = fadd double %1, 3.000000e+00
>>   ret double %2
>> }
>>
>> ; Function Attrs: uwtable
>> define double @julia_g_66157(double) #0 {
>> top:
>>   %1 = call double @llvm.fmuladd.f64(double %0, double 2.000000e+00, 
>> double 3.000000e+00)
>>   ret double %1
>> }
>>
>> ; Function Attrs: uwtable
>> define double @julia_h_66158(double) #0 {
>> top:
>>   %1 = call double @llvm.fma.f64(double %0, double 2.000000e+00, double 
>> 3.000000e+00)
>>   ret double %1
>> }
>> .text
>> Filename: console
>> pushq %rbp
>> movq %rsp, %rbp
>> Source line: 1
>> addsd %xmm0, %xmm0
>> movabsq $534749456, %rax        # imm = 0x1FDFA110
>> addsd (%rax), %xmm0
>> popq %rbp
>> retq
>> nopl (%rax,%rax)
>> .text
>> Filename: console
>> pushq %rbp
>> movq %rsp, %rbp
>> Source line: 2
>> addsd %xmm0, %xmm0
>> movabsq $534749584, %rax        # imm = 0x1FDFA190
>> addsd (%rax), %xmm0
>> popq %rbp
>> retq
>> nopl (%rax,%rax)
>> .text
>> Filename: console
>> pushq %rbp
>> movq %rsp, %rbp
>> movabsq $534749712, %rax        # imm = 0x1FDFA210
>> Source line: 3
>> movsd dcabs164_(%rax), %xmm1  # xmm1 = mem[0],zero
>> movabsq $534749720, %rax        # imm = 0x1FDFA218
>> movsd (%rax), %xmm2           # xmm2 = mem[0],zero
>> movabsq $fma, %rax
>> popq %rbp
>> jmpq *%rax
>> nop
>>
>> This seems to be similar to the first result.
>>
>>
>
>
> -- 
> Erik Schnetter <[email protected] <javascript:>> 
> http://www.perimeterinstitute.ca/personal/eschnetter/
>

Re: [julia-users] Is FMA/Muladd Working Here?

Reply via email to