On Wed, 2018-11-21 at 08:44 +0800, kernel test robot wrote:
> Greeting,
> 
> FYI, we noticed a -65.2% regression of will-it-scale.per_thread_ops due to 
> commit:
> 
> 
> commit: 3c19f2312f48a3d36a4e13f5072a6a95e755b3d5 ("fs/locks: always 
> delete_block after waiting.")
> https://git.kernel.org/cgit/linux/kernel/git/jlayton/linux.git locks-4.21
> 
> in testcase: will-it-scale
> on test machine: 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz with 
> 64G memory
> with following parameters:
> 
>       nr_task: 100%
>       mode: thread
>       test: lock1
>       ucode: 0xb00002e
>       cpufreq_governor: performance
> 
> test-description: Will It Scale takes a testcase and runs it from 1 through 
> to n parallel copies to see if the testcase will scale. It builds both a 
> process and threads based test in order to see any differences between the 
> two.
> test-url: https://github.com/antonblanchard/will-it-scale
> 
> 

Looking at the testcase, it just does whole-file exclusive fcntl
lock/unlock cycles. My gut feeling is that the reduction in wakeups is
probably worth some extra processing time here, but maybe it's possible
to do better.

If I'm looking at the profile below correctly, it looks like we're
spending quite a bit more time spinning on spinlocks. That makes some
sense since we're now taking the blocked_lock_lock in more cases.

I wonder if we could optimize that function away in some cases without
taking the lock? Maybe if fl_blocked_requests is empty and fl_blocker is
NULL? How to test those in a race-free way without taking the spinlock
may not be possible though.

Thoughts?


> 
> Details are as below:
> -------------------------------------------------------------------------------------------------->
> 
> 
> To reproduce:
> 
>         git clone https://github.com/intel/lkp-tests.git
>         cd lkp-tests
>         bin/lkp install job.yaml  # job file is attached in this email
>         bin/lkp run     job.yaml
> 
> =========================================================================================
> compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase/ucode:
>   
> gcc-7/performance/x86_64-rhel-7.2/thread/100%/debian-x86_64-2018-04-03.cgz/lkp-bdw-ep3b/lock1/will-it-scale/0xb00002e
> 
> commit: 
>   816f2fb5a2 ("fs/locks: allow a lock request to block other requests.")
>   3c19f2312f ("fs/locks: always delete_block after waiting.")
> 
> 816f2fb5a2fc678c 3c19f2312f48a3d36a4e13f507 
> ---------------- -------------------------- 
>          %stddev     %change         %stddev
>              \          |                \  
>      71447           -65.2%      24854        will-it-scale.per_thread_ops
>     138940            -2.9%     134886        
> will-it-scale.time.involuntary_context_switches
>     279.85           -64.2%     100.29        will-it-scale.time.user_time
>    6287454           -65.2%    2187242        will-it-scale.workload
>       1.09            -0.7        0.42        mpstat.cpu.usr%
>     371230 ±  4%      +9.5%     406403        softirqs.SCHED
>       1803 ± 16%     +48.9%       2685 ±  8%  numa-meminfo.node0.PageTables
>       2784 ± 10%     -30.7%       1928 ± 12%  numa-meminfo.node1.PageTables
>     224.55            -1.8%     220.57        turbostat.PkgWatt
>       7.70            -3.0%       7.47        turbostat.RAMWatt
>     450.50 ± 17%     +49.0%     671.25 ±  8%  
> numa-vmstat.node0.nr_page_table_pages
>     644147 ± 10%     -19.8%     516646 ± 11%  numa-vmstat.node0.numa_hit
>     639812 ± 10%     -20.6%     508027 ± 12%  numa-vmstat.node0.numa_local
>     696.25 ± 10%     -30.7%     482.50 ± 12%  
> numa-vmstat.node1.nr_page_table_pages
>       4617            +2.1%       4715        proc-vmstat.nr_inactive_anon
>       7097            +2.0%       7241        proc-vmstat.nr_mapped
>      20507            +7.0%      21934 ±  3%  proc-vmstat.nr_shmem
>       4617            +2.1%       4715        
> proc-vmstat.nr_zone_inactive_anon
>     690109            +1.0%     696863        proc-vmstat.numa_hit
>     672911            +1.0%     679694        proc-vmstat.numa_local
>      23133 ±  2%      +8.9%      25196 ±  4%  proc-vmstat.pgactivate
>     607.03 ±  6%     -16.0%     509.80 ± 12%  
> sched_debug.cfs_rq:/.util_est_enqueued.avg
>      24.42 ± 28%     +38.2%      33.75 ± 22%  sched_debug.cpu.cpu_load[2].max
>       2.20 ± 28%     +39.9%       3.08 ± 23%  
> sched_debug.cpu.cpu_load[2].stddev
>      25.33 ± 12%     +23.2%      31.21 ±  9%  sched_debug.cpu.cpu_load[3].max
>       2.28 ± 21%     +29.6%       2.95 ± 12%  
> sched_debug.cpu.cpu_load[3].stddev
>      52140 ± 23%     +37.1%      71510 ±  3%  sched_debug.cpu.nr_switches.max
>      53379 ± 24%     +46.4%      78158 ± 11%  sched_debug.cpu.sched_count.max
>       7132 ± 12%     +32.3%       9436 ± 15%  
> sched_debug.cpu.sched_count.stddev
>  4.587e+12            -7.5%  4.245e+12        perf-stat.branch-instructions
>       0.24            -0.1        0.15        perf-stat.branch-miss-rate%
>  1.107e+10           -43.0%  6.312e+09        perf-stat.branch-misses
>      40.04            -2.0       38.01        perf-stat.cache-miss-rate%
>  8.415e+09 ±  2%     -19.4%  6.782e+09 ±  6%  perf-stat.cache-misses
>  2.101e+10           -15.1%  1.783e+10 ±  5%  perf-stat.cache-references
>       3.85           +10.7%       4.26        perf-stat.cpi
>       0.00 ±  2%      +0.0        0.00 ±  4%  perf-stat.dTLB-load-miss-rate%
>   90399109 ±  2%      +6.6%   96381582 ±  4%  perf-stat.dTLB-load-misses
>  4.956e+12           -11.6%   4.38e+12        perf-stat.dTLB-loads
>       0.00 ±  8%      +0.0        0.01 ± 24%  perf-stat.dTLB-store-miss-rate%
>  8.789e+11           -61.0%  3.427e+11        perf-stat.dTLB-stores
>      80.76           -10.8       69.98        perf-stat.iTLB-load-miss-rate%
>  3.901e+09           -63.3%   1.43e+09        perf-stat.iTLB-load-misses
>    9.3e+08 ±  6%     -34.0%  6.135e+08 ±  2%  perf-stat.iTLB-loads
>  1.912e+13            -9.6%  1.728e+13        perf-stat.instructions
>       4901          +146.5%      12081        
> perf-stat.instructions-per-iTLB-miss
>       0.26            -9.6%       0.23        perf-stat.ipc
>      82.36            -3.7       78.70        perf-stat.node-store-miss-rate%
>  2.319e+09           -20.6%  1.842e+09        perf-stat.node-store-misses
>    3041599          +159.7%    7898884        perf-stat.path-length
>      61.02 ± 10%     -61.0        0.00        
> perf-profile.calltrace.cycles-pp._raw_spin_lock.fcntl_setlk.do_fcntl.__x64_sys_fcntl.do_syscall_64
>      60.64 ± 10%     -60.6        0.00        
> perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.fcntl_setlk.do_fcntl.__x64_sys_fcntl
>      98.79            +0.7       99.50        
> perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe
>      98.76            +0.7       99.49        
> perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe
>      98.64            +0.8       99.44        
> perf-profile.calltrace.cycles-pp.__x64_sys_fcntl.do_syscall_64.entry_SYSCALL_64_after_hwframe
>      97.70            +1.5       99.16        
> perf-profile.calltrace.cycles-pp.do_fcntl.__x64_sys_fcntl.do_syscall_64.entry_SYSCALL_64_after_hwframe
>      97.41            +1.6       99.05        
> perf-profile.calltrace.cycles-pp.fcntl_setlk.do_fcntl.__x64_sys_fcntl.do_syscall_64.entry_SYSCALL_64_after_hwframe
>      35.73 ± 18%     +62.7       98.45        
> perf-profile.calltrace.cycles-pp.do_lock_file_wait.fcntl_setlk.do_fcntl.__x64_sys_fcntl.do_syscall_64
>       0.00           +65.1       65.07        
> perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.locks_delete_block.do_lock_file_wait.fcntl_setlk
>       0.00           +65.3       65.28        
> perf-profile.calltrace.cycles-pp._raw_spin_lock.locks_delete_block.do_lock_file_wait.fcntl_setlk.do_fcntl
>       0.00           +65.3       65.31        
> perf-profile.calltrace.cycles-pp.locks_delete_block.do_lock_file_wait.fcntl_setlk.do_fcntl.__x64_sys_fcntl
>       1.13 ±  2%      -0.7        0.43        
> perf-profile.children.cycles-pp.locks_alloc_lock
>       0.98 ±  2%      -0.6        0.38        
> perf-profile.children.cycles-pp.kmem_cache_alloc
>       0.59            -0.4        0.23 ±  2%  
> perf-profile.children.cycles-pp.syscall_return_via_sysret
>       0.53            -0.3        0.20 ±  2%  
> perf-profile.children.cycles-pp.entry_SYSCALL_64
>       0.35 ± 11%      -0.3        0.06 ±  9%  
> perf-profile.children.cycles-pp.fput
>       0.41 ±  2%      -0.3        0.15 ±  3%  
> perf-profile.children.cycles-pp.file_has_perm
>       0.33 ±  2%      -0.2        0.12 ±  6%  
> perf-profile.children.cycles-pp.memset_erms
>       0.30 ±  3%      -0.2        0.11 ±  3%  
> perf-profile.children.cycles-pp.security_file_lock
>       0.25 ±  3%      -0.2        0.10 ±  5%  
> perf-profile.children.cycles-pp.security_file_fcntl
>       0.24 ±  2%      -0.1        0.10 ±  4%  
> perf-profile.children.cycles-pp._copy_from_user
>       0.22 ± 12%      -0.1        0.07 ±  5%  
> perf-profile.children.cycles-pp.__fget_light
>       0.21 ±  3%      -0.1        0.08 ±  6%  
> perf-profile.children.cycles-pp.avc_has_perm
>       0.20 ±  5%      -0.1        0.08        
> perf-profile.children.cycles-pp.___might_sleep
>       0.16 ±  5%      -0.1        0.06        
> perf-profile.children.cycles-pp.__fget
>       0.24 ±  3%      -0.1        0.17 ±  2%  
> perf-profile.children.cycles-pp.kmem_cache_free
>       0.12 ±  5%      -0.1        0.05        
> perf-profile.children.cycles-pp.__might_sleep
>       0.24 ± 15%      -0.1        0.18        
> perf-profile.children.cycles-pp.locks_insert_lock_ctx
>       0.11 ±  3%      -0.0        0.10 ±  4%  
> perf-profile.children.cycles-pp.locks_free_lock
>      98.83            +0.7       99.54        
> perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
>      98.79            +0.7       99.52        
> perf-profile.children.cycles-pp.do_syscall_64
>      98.65            +0.8       99.44        
> perf-profile.children.cycles-pp.__x64_sys_fcntl
>      97.71            +1.5       99.17        
> perf-profile.children.cycles-pp.do_fcntl
>      97.42            +1.6       99.05        
> perf-profile.children.cycles-pp.fcntl_setlk
>      94.97            +3.0       97.98        
> perf-profile.children.cycles-pp._raw_spin_lock
>      93.97            +3.3       97.24        
> perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
>      35.74 ± 18%     +62.7       98.46        
> perf-profile.children.cycles-pp.do_lock_file_wait
>       0.00           +65.3       65.31        
> perf-profile.children.cycles-pp.locks_delete_block
>       0.59            -0.4        0.23 ±  2%  
> perf-profile.self.cycles-pp.syscall_return_via_sysret
>       0.53            -0.3        0.20 ±  2%  
> perf-profile.self.cycles-pp.entry_SYSCALL_64
>       0.35 ± 10%      -0.3        0.06 ±  9%  perf-profile.self.cycles-pp.fput
>       1.00 ±  2%      -0.3        0.74        
> perf-profile.self.cycles-pp._raw_spin_lock
>       0.38 ±  3%      -0.2        0.14        
> perf-profile.self.cycles-pp.kmem_cache_alloc
>       0.32 ±  2%      -0.2        0.12 ±  3%  
> perf-profile.self.cycles-pp.memset_erms
>       0.20 ±  3%      -0.1        0.08 ±  6%  
> perf-profile.self.cycles-pp.avc_has_perm
>       0.20 ±  2%      -0.1        0.08 ±  6%  
> perf-profile.self.cycles-pp.posix_lock_inode
>       0.20 ±  6%      -0.1        0.08        
> perf-profile.self.cycles-pp.___might_sleep
>       0.16 ±  4%      -0.1        0.05 ±  9%  
> perf-profile.self.cycles-pp.__fget
>       0.15 ±  8%      -0.1        0.06        
> perf-profile.self.cycles-pp.fcntl_setlk
>       0.24            -0.1        0.15 ±  2%  
> perf-profile.self.cycles-pp.kmem_cache_free
>       0.11            -0.1        0.03 ±100%  
> perf-profile.self.cycles-pp.__x64_sys_fcntl
>       0.11 ±  4%      -0.1        0.03 ±100%  
> perf-profile.self.cycles-pp.__might_sleep
>       0.13            -0.1        0.05        
> perf-profile.self.cycles-pp.locks_alloc_lock
>       0.13 ±  5%      -0.1        0.05        
> perf-profile.self.cycles-pp.file_has_perm
>       0.07 ±  7%      -0.0        0.05        
> perf-profile.self.cycles-pp.locks_free_lock
>      93.64            +3.3       96.89        
> perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
> 
> 
>                                                                               
>   
>                             will-it-scale.per_thread_ops                      
>   
>                                                                               
>   
>   75000 +-+-----------------------------------------------------------------+ 
>   
>   70000 +-+.. .+.+.+..+.+.+.+..        .+. .+.. .+.+.  .+. .+.+..+.+.+.+..+.| 
>   
>         |    +                 +. .+.+.   +    +     +.   +                 | 
>   
>   65000 +-+                      +                                          | 
>   
>   60000 +-+                                                                 | 
>   
>   55000 +-+                                                                 | 
>   
>   50000 +-+                                                                 | 
>   
>         |                                                                   | 
>   
>   45000 +-+                                                                 | 
>   
>   40000 +-+                                                                 | 
>   
>   35000 +-+                                                                 | 
>   
>   30000 +-+                                                                 | 
>   
>         |                                                                   | 
>   
>   25000 O-O  O O O O  O O O O  O O O O  O O O  O O O O  O O                 | 
>   
>   20000 +-+-----------------------------------------------------------------+ 
>   
>                                                                               
>   
>                                                                               
>                                                                               
>     
>                                 will-it-scale.workload                        
>   
>                                                                               
>   
>   6.5e+06 +-+---------------------------------------------------------------+ 
>   
>           |.+.  .+.+.+.+    +   +.      .+. .+.  .+.+. .+.  .+.+   +.+..+.+ | 
>   
>     6e+06 +-+ +.                  +.+..+   +   +.     +   +.                | 
>   
>   5.5e+06 +-+                                                               | 
>   
>           |                                                                 | 
>   
>     5e+06 +-+                                                               | 
>   
>   4.5e+06 +-+                                                               | 
>   
>           |                                                                 | 
>   
>     4e+06 +-+                                                               | 
>   
>   3.5e+06 +-+                                                               | 
>   
>           |                                                                 | 
>   
>     3e+06 +-+                                                               | 
>   
>   2.5e+06 +-+                                                               | 
>   
>           O O O  O O     O  O O O O O  O O O O O  O O O O O                 | 
>   
>     2e+06 +-+--------O-O----------------------------------------------------+ 
>   
>                                                                               
>   
>                                                                               
>                                                                               
>     
>                            will-it-scale.time.user_time                       
>   
>                                                                               
>   
>   300 +-+-------------------------------------------------------------------+ 
>   
>   280 +-+            .+.. .+.                                  .+.. .+.  .+.| 
>   
>       |.+..+.+.+..+.+    +   +..      .+.+.+..+.+.+..+.+.+..+.+    +   +.   | 
>   
>   260 +-+                       +.+.+.                                      | 
>   
>   240 +-+                                                                   | 
>   
>   220 +-+                                                                   | 
>   
>   200 +-+                                                                   | 
>   
>       |                                                                     | 
>   
>   180 +-+                                                                   | 
>   
>   160 +-+                                                                   | 
>   
>   140 +-+  O                                                                | 
>   
>   120 +-+                                                                   | 
>   
>       O O    O O                  O                                         | 
>   
>   100 +-+         O O O  O O O  O   O  O O O  O O O  O O O                  | 
>   
>    80 +-+-------------------------------------------------------------------+ 
>   
>                                                                               
>   
>                                                                               
>   
> [*] bisect-good sample
> [O] bisect-bad  sample
> 
> 
> 
> Disclaimer:
> Results have been estimated based on internal Intel analysis and are provided
> for informational purposes only. Any difference in system hardware or software
> design or configuration may affect actual performance.
> 
> 
> Thanks,
> Rong Chen

-- 
Jeff Layton <[email protected]>

Reply via email to