[Public] Hi All,
Saring `rte_topology_` API patch next version targeted for upcoming release. Extras adding support for Cache-ID for L2 and L3 for Cache line stashing, Code Data Prioritization too. Snipped > > > > > Hello Vipin and others, > > > > please, will there be any progress or update on this series? > > Apologies, we did a small update in slack, and missed this out here. Let me > try to > address your questions below > > > > > I successfully tested those changes on our Intel and AMD machines and > > would like to use it in production soon. > > > > The API is a little bit unintuitive, at least for me, but I > > successfully integrated into our software. > > > > I am missing a clear relation to the NUMA socket approach used in DPDK. > > E.g. I would like to be able to easily walk over a list of lcores from > > a specific NUMA node grouped by L3 domain. Yes, there is the > > RTE_LCORE_DOMAIN_IO, but would it always match the appropriate socket > IDs? > > Yes, we from AMD were internally debating the same. But since there is an API > in > lcore API as ` rte_lcore_to_socket_id`, adding yet another variation or > argument > lack it luster. > Hence we internally debating when using the new API why not check if it is > desired > Physical Socket or Sub Socket Numa domain? > > Hence, we did not add the option. > > > > > Also, I do not clearly understand what is the purpose of using domain > > selector > like: > > > > RTE_LCORE_DOMAIN_L1 | RTE_LCORE_DOMAIN_L2 > > > > or even: > > > > RTE_LCORE_DOMAIN_L3 | RTE_LCORE_DOMAIN_L2 > > I believe we have mentioned in documents to choose 1, if used multiple combo > based on the code flow only 1 will be picked up. > > real use of these are to select physical cores, under same cache or io domain. > Example: certain SoC has 4 cores sharing L2, which makes pipeline processing > more convinent (less data movement). In such cases select lcores within same > L2 > topologoly. > > > > > the documentation does not explain this. I could not spot any kind of > > grouping that would help me in any way. Some "best practices" examples > > would be nice to have to understand the intentions better. > > From https://patches.dpdk.org/project/dpdk/cover/20241105102849.1947-1- > vipin.vargh...@amd.com/ > > ``` > Reason: > - Applications using DPDK libraries relies on consistent memory access. > - Lcores being closer to same NUMA domain as IO. > - Lcores sharing same cache. > > Latency is minimized by using lcores that share the same NUMA topology. > Memory access is optimized by utilizing cores within the same NUMA domain or > tile. Cache coherence is preserved within the same shared cache domain, > reducing > the remote access from tile|compute package via snooping (local hit in either > L2 or > L3 within same NUMA domain). > ``` > > > > > I found a little catch when running DPDK with more lcores than there > > are physical or SMT CPU cores. This happens when using e.g. an option like > > -- > lcores=(0-15)@(0-1). > > The results from the topology API would not match the lcores because > > hwloc is not aware of the lcores concept. This might be mentioned somewhere. > > Yes, this is expected. As one can map any cpu cores to dpdk lcore with `lcore- > map`. > We did mentioned this in RFCv4, but when upgraded to RFCv5 we missed to > mention it back. > > > > > Anyway, I really appreciate this work and would like to see it upstream. > > Especially for AMD machines, some framework like this is a must. > > > > Kind regards, > > Jan > > > > We are planning to remove RFC tag and share the final version for upcoming > release for DPDK shortly.