[Public]

Hi All,

Saring `rte_topology_` API patch next version targeted for upcoming release.
Extras adding support for Cache-ID for L2 and L3 for Cache line stashing, Code 
Data Prioritization too.

Snipped

>
> >
> > Hello Vipin and others,
> >
> > please, will there be any progress or update on this series?
>
> Apologies, we did a small update in slack, and missed this out here. Let me 
> try to
> address your questions below
>
> >
> > I successfully tested those changes on our Intel and AMD machines and
> > would like to use it in production soon.
> >
> > The API is a little bit unintuitive, at least for me, but I
> > successfully integrated into our software.
> >
> > I am missing a clear relation to the NUMA socket approach used in DPDK.
> > E.g. I would like to be able to easily walk over a list of lcores from
> > a specific NUMA node grouped by L3 domain. Yes, there is the
> > RTE_LCORE_DOMAIN_IO, but would it always match the appropriate socket
> IDs?
>
> Yes, we from AMD were internally debating the same. But since there is an API 
> in
> lcore API as ` rte_lcore_to_socket_id`, adding yet another variation or 
> argument
> lack it luster.
> Hence we internally debating when using the new API why not check if it is 
> desired
> Physical Socket or Sub Socket Numa domain?
>
> Hence, we did not add the option.
>
> >
> > Also, I do not clearly understand what is the purpose of using domain 
> > selector
> like:
> >
> >   RTE_LCORE_DOMAIN_L1 | RTE_LCORE_DOMAIN_L2
> >
> > or even:
> >
> >   RTE_LCORE_DOMAIN_L3 | RTE_LCORE_DOMAIN_L2
>
> I believe we have mentioned in documents to choose 1, if used multiple combo
> based on the code flow only 1 will be picked up.
>
> real use of these are to select physical cores, under same cache or io domain.
> Example: certain SoC has 4 cores sharing L2, which makes pipeline processing
> more convinent (less data movement). In such cases select lcores within same 
> L2
> topologoly.
>
> >
> > the documentation does not explain this. I could not spot any kind of
> > grouping that would help me in any way. Some "best practices" examples
> > would be nice to have to understand the intentions better.
>
> From https://patches.dpdk.org/project/dpdk/cover/20241105102849.1947-1-
> vipin.vargh...@amd.com/
>
> ```
> Reason:
>  - Applications using DPDK libraries relies on consistent memory access.
>  - Lcores being closer to same NUMA domain as IO.
>  - Lcores sharing same cache.
>
> Latency is minimized by using lcores that share the same NUMA topology.
> Memory access is optimized by utilizing cores within the same NUMA domain or
> tile. Cache coherence is preserved within the same shared cache domain, 
> reducing
> the remote access from tile|compute package via snooping (local hit in either 
> L2 or
> L3 within same NUMA domain).
> ```
>
> >
> > I found a little catch when running DPDK with more lcores than there
> > are physical or SMT CPU cores. This happens when using e.g. an option like 
> > --
> lcores=(0-15)@(0-1).
> > The results from the topology API would not match the lcores because
> > hwloc is not aware of the lcores concept. This might be mentioned somewhere.
>
> Yes, this is expected. As one can map any cpu cores to dpdk lcore with `lcore-
> map`.
> We did mentioned this in RFCv4, but when upgraded to RFCv5 we missed to
> mention it back.
>
> >
> > Anyway, I really appreciate this work and would like to see it upstream.
> > Especially for AMD machines, some framework like this is a must.
> >
> > Kind regards,
> > Jan
> >
>
> We are planning to remove RFC tag and share the final version for upcoming
> release for DPDK shortly.

Reply via email to