>> > From: Pavan Nikhilesh <[email protected]>
>> >
>> > Add RTE_OPTIMAL_BURST_SIZE to allow platforms to configure the
>> > optimal burst size.
>> >
>> > Set default value to 64 for soc_cn10k and 32 generally.
>> >
>> > Signed-off-by: Pavan Nikhilesh <[email protected]>
>> > ---
>> > This improves performance by 5% on l2fwd, other examples showed
>> > negligible difference on CN10K.
>> >
>>
>> I support the concept of having a recommended mbuf burst size, targeting the 
>> majority of generic applications.
>> Making it CPU dependent seems like a good choice.
>>
>> It should be named differently.
>> First of all, "optimal" depends on the use case; if targeting low latency, 
>> shorter bursts are better, so "OPTIMAL" should not be part of the name.
>> Second, I would guess that it only targets mbuf bursts, not also bursts of 
>> other operations (e.g. hash lookups), so "MBUF" should be part of the name.
>>
>> Suggestion:
>> /* Recommended burst size for generic applications, striking a balance 
>> between throughput and latency. */
>> dpdk_conf.set('RTE_MBUF_BURST_SIZE_MAX' (or _DEFAULT), 64)
>>
>> <feature creep>
>> /* Recommended burst size for generic applications targeting low latency. */
>> dpdk_conf.set('RTE_MBUF_BURST_SIZE_MIN', 4)
>> </feature creep>
>>
>> Having these standardized will also allow libraries and drivers to optimize 
>> for them, e.g. drivers should support bursts sizes all the way down to 
>> RTE_MBUF_BURST_SIZE_MIN, and can static_assert() that the 
>> RTE_MBUF_BURST_SIZE_MIN is not lower than supported by the driver/hardware.
>>
>> <more feature creep>
>> rte_config.h could have "#define RTE_MBUF_BURST_SIZE 
>> RTE_MBUF_BURST_SIZE_MAX", for the application developer to change to 
>> RTE_MBUF_BURST_SIZE_MIN for low latency applications.
>> This will let the libraries and drivers optimize for the specific burst size 
>> used by the application.
>> </more feature creep>
>>
>> <rambling>
>> Intuitively, I would assume that the optimal burst size essentially depends 
>> on the CPU's L1D cache size and the application's number of non-mbuf cache 
>> lines accessed per burst.
>> Let's say a CPU core has 32 KiB cache (= 512 cache lines), and each burst 
>> touches 4 cache lines per packet:
>> 2 cache lines for the mbuf
>> 1 cache line for the packet data
>> 1 cache line per packet for some table lookup/forwarding entry
>>
>> Then the mbuf burst should be max 512/4 = 128.
>> But local variables also use memory during processing, so using a burst of 
>> 64 would leave room for that and some more.
>> </rambling>
>>
>> >  config/arm/meson.build | 1 +
>> >  config/meson.build     | 1 +
>> >  2 files changed, 2 insertions(+)
>> >
>> > diff --git a/config/arm/meson.build b/config/arm/meson.build
>> > index 523b0fc0ed50..fa64c07016b1 100644
>> > --- a/config/arm/meson.build
>> > +++ b/config/arm/meson.build
>> > @@ -481,6 +481,7 @@ soc_cn10k = {
>> >          ['RTE_MAX_LCORE', 24],
>> >          ['RTE_MAX_NUMA_NODES', 1],
>> >          ['RTE_MEMPOOL_ALIGN', 128],
>> > +        ['RTE_OPTIMAL_BURST_SIZE', 64],
>> >      ],
>> >      'part_number': '0xd49',
>> >      'extra_march_features': ['crypto'],
>> > diff --git a/config/meson.build b/config/meson.build
>> > index 0cb074ab95b7..95367ae88e2d 100644
>> > --- a/config/meson.build
>> > +++ b/config/meson.build
>> > @@ -386,6 +386,7 @@ if get_option('mbuf_refcnt_atomic')
>> >      dpdk_conf.set('RTE_MBUF_REFCNT_ATOMIC', true)
>> >  endif
>> >  dpdk_conf.set10('RTE_IOVA_IN_MBUF', get_option('enable_iova_as_pa'))
>> > +dpdk_conf.set('RTE_OPTIMAL_BURST_SIZE', 32)
>> >
>> >  compile_time_cpuflags = []
>> >  subdir(arch_subdir)
>> > --
>> > 2.50.1 (Apple Git-155)
>
>I understand the motivation, and it make sense for a pure embedded system.
>But then again on an embedded system the application can just set its burst 
>size;
>this config option only impacts performance of testpmd and examples. And the
>performance of testpmd is mostly irrelevant what matters is the real 
>application.
>

True, but generally customer engagements start with benchmarking testpmd/l3fwd 
etc.
berfore moving to custom apps.
So, having better performance numbers helps.

>Making it a DPDK config option is a problem for DPDK build in distros.
>The optimal burst size would be driver dependent etc.
>

Since we are not modifying the current default burst size (32) it shouldn't be 
a problem and
can even benifit SoCs.

>Perhaps better off in the existing rx / tx descriptor hints.
>Most of those device configs really need to be relooked at
>since they were inherited from how old Intel drivers worked.

Reply via email to