Thank you for posting this detailed analysis. This is all I was asking for. I knew you had to have some hard data to want that change.
I thought at one point, we had populated the new stack with the checksum code from the old stack for the highly optimized cases. Was the PowerPC one where we just trusted their new code and our old code was more optimized? Am I not remembering this correctly? The other hot code spot was memcpy() but that should be the same in both configurations since it is in newlib. --joel On 9/25/2014 1:15 AM, Sebastian Huber wrote: > Hello, > > I used simple FTP transfers to/from the target to measure the TCP performance > of the new network stack on a PowerPC MPC8309. The new network stack is a > port > from FreeBSD 9.2. It is highly optimized for SMP and uses fine grained > locking. For uni-processor systems this is not a benefit. About 2000 mutexes > are present in the idle state of the stack. It turned out that the standard > RTEMS semaphores are a major performance bottleneck. I added a light weight > alternative (rtems_bsd_mutex). For fine grained locking it is important that > the uncontested mutex obtain/release is as fast as possible. > > With the latest version (struct timespec and rtems_bsd_mutex) I get this: > > curl -o /dev/null ftp://anonymous@192.168.100.70/dev/zero > % Total % Received % Xferd Average Speed Time Time Time > Current > Dload Upload Total Spent Left Speed > 0 0 0 1194M 0 0 9101k 0 --:--:-- 0:02:14 --:--:-- > 9158k > > perf disabled coverage: 100.000% runtime: 99.998% covtime: > 100.000% > name________________________|ratio___|1%_____2%________5%_____10%_____20%_____| > in_cksumdata | 11.137%|========================== > | > memcpy | 10.430%|========================= > | > tcp_output | 7.189%|===================== > | > ip_output | 3.241%|============= > | > uma_zalloc_arg | 2.710%|=========== > | > ether_output | 2.533%|========== > | > tcp_do_segment | 2.121%|======== > | > m_copym | 2.062%|======== > | > uma_zfree_arg | 2.062%|======== > | > bsd__mtx_unlock_flags | 2.062%|======== > | > tcp_input | 2.003%|======= > | > Thread_Dispatch | 1.885%|======= > | > rtalloc1_fib | 1.649%|===== > | > ip_input | 1.708%|====== > | > memmove | 1.532%|==== > | > rn_match | 1.473%|==== > | > tcp_addoptions | 1.414%|==== > | > arpresolve | 1.355%|=== > | > in_cksum_skip | 1.296%|=== > | > memset | 1.296%|=== > | > mb_dupcl | 1.178%|== > | > uec_if_dequeue | 1.178%|== > | > in_lltable_lookup | 1.119%|= > | > rtfree | 1.001%|< > | > ether_nh_input | 1.001%|< > | > uec_if_bd_wait_and_free | 1.001%|< > | > quicc_bd_tx_submit_and_wait | 1.001%|< > | > TOD_Get_with_nanoseconds | 1.001%|< > | > uec_if_interface_start | 0.942%|< > | > bsd__mtx_lock_flags | 0.883%|< > | > bzero | 0.883%|< > | > mb_ctor_mbuf | 0.824%|< > | > mb_free_ext | 0.824%|< > | > netisr_dispatch_src | 0.824%|< > | > in_pcblookup_hash_locked.isr| 0.766%|< > | > bsd_critical_enter | 0.766%|< > | > rw_runlock | 0.707%|< > | > if_transmit | 0.707%|< > | > Timespec_Add_to | 0.707%|< > | > in_delayed_cksum | 0.648%|< > | > tcp_timer_active | 0.648%|< > | > ether_demux | 0.648%|< > | > ppc_clock_nanoseconds_since_| 0.648%|< > | > RBTree_Find | 0.648%|< > | > Thread_Enable_dispatch | 0.648%|< > | > rw_rlock | 0.589%|< > | > callout_reset_on | 0.589%|< > | > in_clsroute | 0.589%|< > | > > We have 3% processor load due to mutex operations (_bsd__mtx_lock_flags() and > _bsd__mtx_unlock_flags()). > > With the 64-bit nanoseconds timestamp I get this: > > curl -o /dev/null ftp://anonymous@192.168.100.70/dev/zero > % Total % Received % Xferd Average Speed Time Time Time > Current > Dload Upload Total Spent Left Speed > 0 0 0 830M 0 0 8834k 0 --:--:-- 0:01:39 --:--:-- > 8982k > > perf disabled coverage: 100.000% runtime: 99.998% covtime: > 100.000% > name____________________________________|ratio___|1%_____2%________5%_____10%_| > in_cksumdata | 10.130%|========================= > | > memcpy | 9.786%|======================== > | > tcp_output | 8.890%|======================= > | > ip_output | 5.031%|================= > | > ether_output | 2.618%|========== > | > Thread_Dispatch | 2.549%|========== > | > __divdi3 | 2.205%|======== > | > bsd__mtx_unlock_flags | 2.136%|======== > | > __moddi3 | 2.067%|======== > | > tcp_input | 1.998%|======= > | > uma_zalloc_arg | 1.929%|======= > | > m_copym | 1.654%|===== > | > tcp_do_segment | 1.654%|===== > | > tcp_addoptions | 1.516%|==== > | > sbdrop_internal | 1.447%|==== > | > mb_free_ext | 1.378%|=== > | > uma_zfree_arg | 1.309%|=== > | > ip_input | 1.240%|== > | > in_cksum_skip | 1.171%|= > | > uec_if_interface_start | 1.171%|= > | > quicc_bd_tx_submit_and_wait | 1.171%|= > | > callout_reset_on | 1.102%|= > | > rtfree | 1.033%| > | > uec_if_dequeue | 1.102%|= > | > rn_match | 0.964%|< > | > rtalloc1_fib | 0.964%|< > | > ether_nh_input | 0.964%|< > | > uec_if_bd_wait_and_free | 0.964%|< > | > mb_ctor_mbuf | 0.895%|< > | > in_lltable_lookup | 0.895%|< > | > memset | 0.895%|< > | > uec_if_bd_wait.constprop.9 | 0.827%|< > | > mb_dupcl | 0.758%|< > | > cc_ack_received.isra.0 | 0.758%|< > | > tcp_timer_active | 0.758%|< > | > bsd__mtx_lock_flags | 0.689%|< > | > netisr_dispatch_src | 0.689%|< > | > in_pcblookup_hash_locked.isra.1 | 0.689%|< > | > tcp_xmit_timer | 0.689%|< > | > sosend_generic | 0.620%|< > | > rtems_bsd_chunk_get_info | 0.620%|< > | > Thread_Enable_dispatch | 0.620%|< > | > bzero | 0.620%|< > | > rw_runlock | 0.551%|< > | > uma_find_refcnt | 0.551%|< > | > arpresolve | 0.551%|< > | > chunk_compare | 0.551%|< > | > ether_demux | 0.551%|< > | > rtems_clock_get_uptime_timeval | 0.551%|< > | > TOD_Get_with_nanoseconds | 0.551%|< > | > memcmp | 0.551%|< > | > mb_ctor_clust | 0.482%|< > | > in_pcblookup_hash | 0.482%|< > | > in_clsroute | 0.482%|< > | > > So we 4.2% processor load due to the 64-bit divisions and the throughput drops > by 3%. > > With the standard RTEMS objects I get this: > > curl -o /dev/null ftp://anonymous@192.168.100.70/dev/zero > % Total % Received % Xferd Average Speed Time Time Time > Current > Dload Upload Total Spent Left Speed > 0 0 0 927M 0 0 8438k 0 --:--:-- 0:01:52 --:--:-- > 8528k > > perf disabled coverage: 100.000% runtime: 99.997% covtime: > 100.000% > name____________________________________|ratio___|1%_____2%________5%_____10%_| > in_cksumdata | 10.184%|========================= > | > memcpy | 9.052%|======================== > | > tcp_output | 8.382%|======================= > | > ip_output | 3.310%|============= > | > rtems_semaphore_obtain | 3.017%|============ > | > ether_output | 2.598%|========== > | > Thread_Dispatch | 2.430%|========= > | > uma_zalloc_arg | 1.844%|====== > | > uma_zfree_arg | 1.634%|===== > | > quicc_bd_tx_submit_and_wait | 1.634%|===== > | > tcp_do_segment | 1.550%|===== > | > uec_if_dequeue | 1.508%|==== > | > in_lltable_lookup | 1.466%|==== > | > rn_match | 1.424%|==== > | > rtalloc1_fib | 1.424%|==== > | > ip_input | 1.424%|==== > | > in_cksum_skip | 1.424%|==== > | > rtems_semaphore_release | 1.424%|==== > | > CORE_mutex_Surrender | 1.383%|=== > | > Thread_queue_Dequeue | 1.341%|=== > | > m_copym | 1.257%|== > | > bsd__mtx_lock_flags | 1.173%|= > | > mb_free_ext | 1.173%|= > | > arpresolve | 1.173%|= > | > memset | 1.173%|= > | > tcp_input | 1.131%|= > | > tcp_addoptions | 1.089%|= > | > bsd__mtx_unlock_flags | 1.047%| > | > ether_nh_input | 1.047%| > | > bzero | 0.963%|< > | > rtfree | 0.922%|< > | > netisr_dispatch_src | 0.880%|< > | > mb_dupcl | 0.838%|< > | > rtalloc_ign_fib | 0.838%|< > | > in_broadcast | 0.838%|< > | > uec_if_interface_start | 0.838%|< > | > memmove | 0.838%|< > | > mb_ctor_mbuf | 0.796%|< > | > tcp_timer_active | 0.796%|< > | > chunk_compare | 0.712%|< > | > callout_reset_on | 0.712%|< > | > in_pcblookup_hash_locked | 0.712%|< > | > uec_if_bd_wait_and_free | 0.712%|< > | > RBTree_Find | 0.712%|< > | > tcp_dooptions | 0.670%|< > | > sbsndptr | 0.628%|< > | > if_transmit | 0.586%|< > | > Objects_Get_isr_disable | 0.544%|< > | > > So we 8.5% processor load due mutex operations and the throughput drops by 7%. > > In all configurations we see that the UMA zone allocator used for > mbuf/mcluster > allocations produces a high processor load. If we replace it with a simple > freelist, then we will likely be on par with the old network stack in terms of > throughput on this target. > > The in_cksumdata() is a generic implementation in the new network stack. The > old network stack uses an optimized variant with inline assembler. > > Modern network interface controller support TCP/UDP checksum generation and > checks in hardware. This can be also used with the new network stack. > > -- > Sebastian Huber, embedded brains GmbH > > Address : Dornierstr. 4, D-82178 Puchheim, Germany > Phone : +49 89 189 47 41-16 > Fax : +49 89 189 47 41-09 > E-Mail : sebastian.hu...@embedded-brains.de > PGP : Public key available on request. > > Diese Nachricht ist keine geschäftliche Mitteilung im Sinne des EHUG. > _______________________________________________ > devel mailing list > devel@rtems.org > http://lists.rtems.org/mailman/listinfo/devel -- Joel Sherrill, Ph.D. Director of Research & Development joel.sherr...@oarcorp.com On-Line Applications Research Ask me about RTEMS: a free RTOS Huntsville AL 35805 Support Available (256) 722-9985 _______________________________________________ devel mailing list devel@rtems.org http://lists.rtems.org/mailman/listinfo/devel