Re: Performance tests with new network stack

Joel Sherrill Thu, 25 Sep 2014 07:01:44 -0700

Thank you for posting this detailed analysis. This is all I was
asking for. I knew you had to have some hard data to want
that change.


I thought at one point, we had populated the new stack with
the checksum code from the old stack for the highly optimized
cases. Was the PowerPC one where we just trusted their new
code and our old code was more optimized?

Am I not remembering this correctly?

The other hot code spot was memcpy() but that should be the
same in both configurations since it is in newlib.

--joel
On 9/25/2014 1:15 AM, Sebastian Huber wrote:
> Hello,
>
> I used simple FTP transfers to/from the target to measure the TCP performance
> of the new network stack on a PowerPC MPC8309.  The new network stack is a 
> port
> from FreeBSD 9.2.  It is highly optimized for SMP and uses fine grained
> locking.  For uni-processor systems this is not a benefit.  About 2000 mutexes
> are present in the idle state of the stack.  It turned out that the standard
> RTEMS semaphores are a major performance bottleneck.  I added a light weight
> alternative (rtems_bsd_mutex).  For fine grained locking it is important that
> the uncontested mutex obtain/release is as fast as possible.
>
> With the latest version (struct timespec and rtems_bsd_mutex) I get this:
>
> curl -o /dev/null  ftp://anonymous@192.168.100.70/dev/zero
>    % Total    % Received % Xferd  Average Speed   Time    Time     Time  
> Current
>                                   Dload  Upload   Total   Spent    Left  Speed
>    0     0    0 1194M    0     0  9101k      0 --:--:--  0:02:14 --:--:-- 
> 9158k
>
>        perf disabled   coverage: 100.000%  runtime:  99.998%   covtime: 
> 100.000%
> name________________________|ratio___|1%_____2%________5%_____10%_____20%_____|
> in_cksumdata                | 11.137%|==========================              
> |
> memcpy                      | 10.430%|=========================               
> |
> tcp_output                  |  7.189%|=====================                   
> |
> ip_output                   |  3.241%|=============                           
> |
> uma_zalloc_arg              |  2.710%|===========                             
> |
> ether_output                |  2.533%|==========                              
> |
> tcp_do_segment              |  2.121%|========                                
> |
> m_copym                     |  2.062%|========                                
> |
> uma_zfree_arg               |  2.062%|========                                
> |
> bsd__mtx_unlock_flags       |  2.062%|========                                
> |
> tcp_input                   |  2.003%|=======                                 
> |
> Thread_Dispatch             |  1.885%|=======                                 
> |
> rtalloc1_fib                |  1.649%|=====                                   
> |
> ip_input                    |  1.708%|======                                  
> |
> memmove                     |  1.532%|====                                    
> |
> rn_match                    |  1.473%|====                                    
> |
> tcp_addoptions              |  1.414%|====                                    
> |
> arpresolve                  |  1.355%|===                                     
> |
> in_cksum_skip               |  1.296%|===                                     
> |
> memset                      |  1.296%|===                                     
> |
> mb_dupcl                    |  1.178%|==                                      
> |
> uec_if_dequeue              |  1.178%|==                                      
> |
> in_lltable_lookup           |  1.119%|=                                       
> |
> rtfree                      |  1.001%|<                                       
> |
> ether_nh_input              |  1.001%|<                                       
> |
> uec_if_bd_wait_and_free     |  1.001%|<                                       
> |
> quicc_bd_tx_submit_and_wait |  1.001%|<                                       
> |
> TOD_Get_with_nanoseconds    |  1.001%|<                                       
> |
> uec_if_interface_start      |  0.942%|<                                       
> |
> bsd__mtx_lock_flags         |  0.883%|<                                       
> |
> bzero                       |  0.883%|<                                       
> |
> mb_ctor_mbuf                |  0.824%|<                                       
> |
> mb_free_ext                 |  0.824%|<                                       
> |
> netisr_dispatch_src         |  0.824%|<                                       
> |
> in_pcblookup_hash_locked.isr|  0.766%|<                                       
> |
> bsd_critical_enter          |  0.766%|<                                       
> |
> rw_runlock                  |  0.707%|<                                       
> |
> if_transmit                 |  0.707%|<                                       
> |
> Timespec_Add_to             |  0.707%|<                                       
> |
> in_delayed_cksum            |  0.648%|<                                       
> |
> tcp_timer_active            |  0.648%|<                                       
> |
> ether_demux                 |  0.648%|<                                       
> |
> ppc_clock_nanoseconds_since_|  0.648%|<                                       
> |
> RBTree_Find                 |  0.648%|<                                       
> |
> Thread_Enable_dispatch      |  0.648%|<                                       
> |
> rw_rlock                    |  0.589%|<                                       
> |
> callout_reset_on            |  0.589%|<                                       
> |
> in_clsroute                 |  0.589%|<                                       
> |
>
> We have 3% processor load due to mutex operations (_bsd__mtx_lock_flags() and
> _bsd__mtx_unlock_flags()).
>
> With the 64-bit nanoseconds timestamp I get this:
>
> curl -o /dev/null  ftp://anonymous@192.168.100.70/dev/zero
>    % Total    % Received % Xferd  Average Speed   Time    Time     Time  
> Current
>                                   Dload  Upload   Total   Spent    Left  Speed
>    0     0    0  830M    0     0  8834k      0 --:--:--  0:01:39 --:--:-- 
> 8982k
>
>        perf disabled   coverage: 100.000%  runtime:  99.998%   covtime: 
> 100.000%
> name____________________________________|ratio___|1%_____2%________5%_____10%_|
> in_cksumdata                            | 10.130%|=========================   
> |
> memcpy                                  |  9.786%|========================    
> |
> tcp_output                              |  8.890%|=======================     
> |
> ip_output                               |  5.031%|=================           
> |
> ether_output                            |  2.618%|==========                  
> |
> Thread_Dispatch                         |  2.549%|==========                  
> |
> __divdi3                                |  2.205%|========                    
> |
> bsd__mtx_unlock_flags                   |  2.136%|========                    
> |
> __moddi3                                |  2.067%|========                    
> |
> tcp_input                               |  1.998%|=======                     
> |
> uma_zalloc_arg                          |  1.929%|=======                     
> |
> m_copym                                 |  1.654%|=====                       
> |
> tcp_do_segment                          |  1.654%|=====                       
> |
> tcp_addoptions                          |  1.516%|====                        
> |
> sbdrop_internal                         |  1.447%|====                        
> |
> mb_free_ext                             |  1.378%|===                         
> |
> uma_zfree_arg                           |  1.309%|===                         
> |
> ip_input                                |  1.240%|==                          
> |
> in_cksum_skip                           |  1.171%|=                           
> |
> uec_if_interface_start                  |  1.171%|=                           
> |
> quicc_bd_tx_submit_and_wait             |  1.171%|=                           
> |
> callout_reset_on                        |  1.102%|=                           
> |
> rtfree                                  |  1.033%|                            
> |
> uec_if_dequeue                          |  1.102%|=                           
> |
> rn_match                                |  0.964%|<                           
> |
> rtalloc1_fib                            |  0.964%|<                           
> |
> ether_nh_input                          |  0.964%|<                           
> |
> uec_if_bd_wait_and_free                 |  0.964%|<                           
> |
> mb_ctor_mbuf                            |  0.895%|<                           
> |
> in_lltable_lookup                       |  0.895%|<                           
> |
> memset                                  |  0.895%|<                           
> |
> uec_if_bd_wait.constprop.9              |  0.827%|<                           
> |
> mb_dupcl                                |  0.758%|<                           
> |
> cc_ack_received.isra.0                  |  0.758%|<                           
> |
> tcp_timer_active                        |  0.758%|<                           
> |
> bsd__mtx_lock_flags                     |  0.689%|<                           
> |
> netisr_dispatch_src                     |  0.689%|<                           
> |
> in_pcblookup_hash_locked.isra.1         |  0.689%|<                           
> |
> tcp_xmit_timer                          |  0.689%|<                           
> |
> sosend_generic                          |  0.620%|<                           
> |
> rtems_bsd_chunk_get_info                |  0.620%|<                           
> |
> Thread_Enable_dispatch                  |  0.620%|<                           
> |
> bzero                                   |  0.620%|<                           
> |
> rw_runlock                              |  0.551%|<                           
> |
> uma_find_refcnt                         |  0.551%|<                           
> |
> arpresolve                              |  0.551%|<                           
> |
> chunk_compare                           |  0.551%|<                           
> |
> ether_demux                             |  0.551%|<                           
> |
> rtems_clock_get_uptime_timeval          |  0.551%|<                           
> |
> TOD_Get_with_nanoseconds                |  0.551%|<                           
> |
> memcmp                                  |  0.551%|<                           
> |
> mb_ctor_clust                           |  0.482%|<                           
> |
> in_pcblookup_hash                       |  0.482%|<                           
> |
> in_clsroute                             |  0.482%|<                           
> |
>
> So we 4.2% processor load due to the 64-bit divisions and the throughput drops
> by 3%.
>
> With the standard RTEMS objects I get this:
>
> curl -o /dev/null  ftp://anonymous@192.168.100.70/dev/zero
>    % Total    % Received % Xferd  Average Speed   Time    Time     Time  
> Current
>                                   Dload  Upload   Total   Spent    Left  Speed
>    0     0    0  927M    0     0  8438k      0 --:--:--  0:01:52 --:--:-- 
> 8528k
>
>        perf disabled   coverage: 100.000%  runtime:  99.997%   covtime: 
> 100.000%
> name____________________________________|ratio___|1%_____2%________5%_____10%_|
> in_cksumdata                            | 10.184%|=========================   
> |
> memcpy                                  |  9.052%|========================    
> |
> tcp_output                              |  8.382%|=======================     
> |
> ip_output                               |  3.310%|=============               
> |
> rtems_semaphore_obtain                  |  3.017%|============                
> |
> ether_output                            |  2.598%|==========                  
> |
> Thread_Dispatch                         |  2.430%|=========                   
> |
> uma_zalloc_arg                          |  1.844%|======                      
> |
> uma_zfree_arg                           |  1.634%|=====                       
> |
> quicc_bd_tx_submit_and_wait             |  1.634%|=====                       
> |
> tcp_do_segment                          |  1.550%|=====                       
> |
> uec_if_dequeue                          |  1.508%|====                        
> |
> in_lltable_lookup                       |  1.466%|====                        
> |
> rn_match                                |  1.424%|====                        
> |
> rtalloc1_fib                            |  1.424%|====                        
> |
> ip_input                                |  1.424%|====                        
> |
> in_cksum_skip                           |  1.424%|====                        
> |
> rtems_semaphore_release                 |  1.424%|====                        
> |
> CORE_mutex_Surrender                    |  1.383%|===                         
> |
> Thread_queue_Dequeue                    |  1.341%|===                         
> |
> m_copym                                 |  1.257%|==                          
> |
> bsd__mtx_lock_flags                     |  1.173%|=                           
> |
> mb_free_ext                             |  1.173%|=                           
> |
> arpresolve                              |  1.173%|=                           
> |
> memset                                  |  1.173%|=                           
> |
> tcp_input                               |  1.131%|=                           
> |
> tcp_addoptions                          |  1.089%|=                           
> |
> bsd__mtx_unlock_flags                   |  1.047%|                            
> |
> ether_nh_input                          |  1.047%|                            
> |
> bzero                                   |  0.963%|<                           
> |
> rtfree                                  |  0.922%|<                           
> |
> netisr_dispatch_src                     |  0.880%|<                           
> |
> mb_dupcl                                |  0.838%|<                           
> |
> rtalloc_ign_fib                         |  0.838%|<                           
> |
> in_broadcast                            |  0.838%|<                           
> |
> uec_if_interface_start                  |  0.838%|<                           
> |
> memmove                                 |  0.838%|<                           
> |
> mb_ctor_mbuf                            |  0.796%|<                           
> |
> tcp_timer_active                        |  0.796%|<                           
> |
> chunk_compare                           |  0.712%|<                           
> |
> callout_reset_on                        |  0.712%|<                           
> |
> in_pcblookup_hash_locked                |  0.712%|<                           
> |
> uec_if_bd_wait_and_free                 |  0.712%|<                           
> |
> RBTree_Find                             |  0.712%|<                           
> |
> tcp_dooptions                           |  0.670%|<                           
> |
> sbsndptr                                |  0.628%|<                           
> |
> if_transmit                             |  0.586%|<                           
> |
> Objects_Get_isr_disable                 |  0.544%|<                           
> |
>
> So we 8.5% processor load due mutex operations and the throughput drops by 7%.
>
> In all configurations we see that the UMA zone allocator used for 
> mbuf/mcluster
> allocations produces a high processor load.  If we replace it with a simple
> freelist, then we will likely be on par with the old network stack in terms of
> throughput on this target.
>
> The in_cksumdata() is a generic implementation in the new network stack.  The
> old network stack uses an optimized variant with inline assembler.
>
> Modern network interface controller support TCP/UDP checksum generation and
> checks in hardware.  This can be also used with the new network stack.
>
> --
> Sebastian Huber, embedded brains GmbH
>
> Address : Dornierstr. 4, D-82178 Puchheim, Germany
> Phone   : +49 89 189 47 41-16
> Fax     : +49 89 189 47 41-09
> E-Mail  : sebastian.hu...@embedded-brains.de
> PGP     : Public key available on request.
>
> Diese Nachricht ist keine geschäftliche Mitteilung im Sinne des EHUG.
> _______________________________________________
> devel mailing list
> devel@rtems.org
> http://lists.rtems.org/mailman/listinfo/devel

-- 
Joel Sherrill, Ph.D.             Director of Research & Development
joel.sherr...@oarcorp.com        On-Line Applications Research
Ask me about RTEMS: a free RTOS  Huntsville AL 35805
Support Available                (256) 722-9985


_______________________________________________
devel mailing list
devel@rtems.org
http://lists.rtems.org/mailman/listinfo/devel

Re: Performance tests with new network stack

Reply via email to