Hmm. On the other hand I'm seeing this: julia> a = rand(5000); julia> b = rand(5000); julia> c = rand(5000) - 0.5; julia> d = rand(5000) - 1;
julia> @time essai(200,a,b); 14.561813 seconds (5 allocations: 1.922 KB) julia> @time essai(200,a,c); 12.003167 seconds (5 allocations: 1.922 KB) julia> @time essai(200,a,d); 9.199016 seconds (5 allocations: 1.922 KB) So I concede it's not such a big part after all. On Friday, September 9, 2016 at 1:04:32 AM UTC+2, DNF wrote: > > But if branch prediction doesn't factor in, what is the explanation of > this: > > *julia> *a *=* *rand*(5000); > > *julia> *b *=* *rand*(5000); > > *julia> *c *=* *rand*(5000) *+* 0.5; > *julia> *d *=* *rand*(5000) *+* 1; > > *julia> **@time* *essai*(200,a,b); > > 14.607105 seconds (5 allocations: 1.922 KB) > > > *julia> **@time* *essai*(200,a,c); > > 8.357925 seconds (5 allocations: 1.922 KB) > > *julia> **@time* *essai*(200,a,d); > > 3.159876 seconds (5 allocations: 1.922 KB) > > > On Friday, September 9, 2016 at 12:53:46 AM UTC+2, Yichao Yu wrote: >> >> Shape is irrelevant since it doesn't affect the order in the loop at all. >> >> Branch prediction is not the issue here. >> >> The issue is optimizing memory access and simd. >> >> It is illegal to optimize the original code into `a[k] += ss1 > ss2`. It >> is legal to optimize the `if ss1 > ss2 ak += 1 end` version to `ak += ss1 > >> ss2` and this is the optimization LLVM should do but doesn't in this case. >> >> Also, the thing to look for to check if there's vectorization in llvm ir >> is the vector type in the loop body like >> >> ``` >> %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ] >> %offset.idx = or i64 %index, 1 >> %20 = add i64 %offset.idx, -1 >> %21 = getelementptr i64, i64* %19, i64 %20 >> %22 = bitcast i64* %21 to <4 x i64>* >> store <4 x i64> zeroinitializer, <4 x i64>* %22, align 8 >> %23 = getelementptr i64, i64* %21, i64 4 >> %24 = bitcast i64* %23 to <4 x i64>* >> store <4 x i64> zeroinitializer, <4 x i64>* %24, align 8 >> %25 = getelementptr i64, i64* %21, i64 8 >> %26 = bitcast i64* %25 to <4 x i64>* >> store <4 x i64> zeroinitializer, <4 x i64>* %26, align 8 >> %27 = getelementptr i64, i64* %21, i64 12 >> %28 = bitcast i64* %27 to <4 x i64>* >> store <4 x i64> zeroinitializer, <4 x i64>* %28, align 8 >> %index.next = add i64 %index, 16 >> %29 = icmp eq i64 %index.next, %n.vec >> ``` >> >> having a BB named `vector.body` doesn't mean the loop is vectorized. >> >> >> >> On Thu, Sep 8, 2016 at 6:40 PM, 'Greg Plowman' via julia-users < >> [email protected]> wrote: >> >>> The difference is probably simd. >>> >>> the branch will code will not use simd. >>> >>> Either of these should eliminate branch and allow simd. >>> ak += ss1>ss2 >>> ak += ifelse(ss1>ss2, 1, 0) >>> >>> Check with @code_llvm, look for section vector.body >>> >>> >>> at 5:45:30 AM UTC+10, Dupont wrote: >>> >>>> What is strange to me is that this is much slower >>>> >>>> >>>> function essai(n, s1, s2) >>>> a = Vector{Int64}(n) >>>> >>>> @inbounds for k = 1:n >>>> ak = 0 >>>> for ss1 in s1, ss2 in s2 >>>> if ss1 > ss2 >>>> ak += 1 >>>> end >>>> end >>>> a[k] = ak >>>> end >>>> end >>>> >>>> >>>> >>
