Hi Vladimir,

>Hi Ankur,
>
>пт, 18 апр. 2025 г. в 15:45, Ankur Dwivedi <adwiv...@marvell.com
><mailto:adwiv...@marvell.com> >:
>
>
>
>       Hi Vladimir,
>       >> diff --git a/lib/node/ip4_lookup_fib.c b/lib/node/ip4_lookup_fib.c
>       >> index e87864e672..c535b191f8 100644
>       >> --- a/lib/node/ip4_lookup_fib.c
>       >> +++ b/lib/node/ip4_lookup_fib.c
>       >> @@ -40,6 +40,169 @@ static struct ip4_lookup_fib_node_main
>       >ip4_lookup_fib_nm;
>       >>   #define IP4_LOOKUP_NODE_PRIV1_OFF(ctx) \
>       >>      (((struct ip4_lookup_fib_node_ctx *)ctx)->mbuf_priv1_off)
>       >>
>       >> +static uint16_t
>       >> +ip4_lookup_fib_node_process(struct rte_graph *graph, struct
>rte_node
>       >*node, void **objs,
>       >> +                        uint16_t nb_objs)
>       >> +{
>       >> +    struct rte_mbuf *mbuf0, *mbuf1, *mbuf2, *mbuf3, **pkts;
>       >> +    struct rte_fib *fib = IP4_LOOKUP_NODE_FIB(node->ctx);
>       >> +    const int dyn = IP4_LOOKUP_NODE_PRIV1_OFF(node->ctx);
>       >> +    struct rte_ipv4_hdr *ipv4_hdr;
>       >> +    uint64_t next_hop[nb_objs];
>       >> +    uint16_t lookup_err = 0;
>       >> +    void **to_next, **from;
>       >> +    uint16_t last_spec = 0;
>       >> +    rte_edge_t next_index;
>       >> +    uint16_t n_left_from;
>       >> +    uint32_t ip[nb_objs];
>       >> +    uint16_t held = 0;
>       >> +    uint32_t drop_nh;
>       >> +    uint16_t next;
>       >> +    int i, rc;
>       >> +
>       >> +    /* Speculative next */
>       >> +    next_index = RTE_NODE_IP4_LOOKUP_NEXT_REWRITE;
>       >> +    /* Drop node */
>       >> +    drop_nh =
>((uint32_t)RTE_NODE_IP4_LOOKUP_NEXT_PKT_DROP) <<
>       >16;
>       >> +
>       >> +    pkts = (struct rte_mbuf **)objs;
>       >> +    from = objs;
>       >> +    n_left_from = nb_objs;
>       >> +
>       >> +    /* Get stream for the speculated next node */
>       >> +    to_next = rte_node_next_stream_get(graph, node, next_index,
>       >> +nb_objs);
>       >> +
>       >> +    for (i = OBJS_PER_CLINE; i < RTE_GRAPH_BURST_SIZE; i +=
>       >OBJS_PER_CLINE)
>       >> +            rte_prefetch0(&objs[i]);
>       >
>       >Does this prefetching loop make any sense? Unless objs are not
>passed across
>       >threads this array likely to be in the cache already.
>       >
>       >And if objs are passed across threads, then why do you start
>prefetching from
>       >the next cache line instead of the first, and why don't you stop at
>nb_objs?
>       For example if cache size is 64 bytes. Then there will be 8 pointers to
>mbuf per cache line (if address size is 8 bytes). If nb_objs is 256, then the
>remaining pointers needs to be prefetched to cache. This loop helps to
>prefetch nb_objs pointers to cache.
>
>
>
>My comment was about necessity. I'll rephrase and enumerate my questions
>to be more clear:
>1. Are mbuf pointers contained in the objs array not in cache? My assumptions
>here are:
>    - If you run graph in RTC mode, objs array is very likely already be in 
> your L1
>cache (since some previous node/nodes just put packets there)

Yes, the objs are more likely to be in L1 cache in RTC. But if they are not the 
above loop prefetches them.
>    - In dispatch mode it doesn't make much sense to run this lookup node in an
>another thread separately from the remaining nodes processing IPv4

If its run in another thread/lcore, the above prefetching will help.
>
>2. if you still thing this prefetching loop is required here, then:
>
>
>       The loop can be changed to stop at nb_objs.
>       for (i = OBJS_PER_CLINE; i < nb_objs; i += OBJS_PER_CLINE) { }
>
>
>
>  why you start from the OBJS_PER_CLINE (second cache line of the objs array)
Because the first line is prefetched in __rte_node_process() which is called in 
rte_graph_walk().

>3. "i < nb_objs". Should be "i < RTE_ALIGN_CEIL(nb_objs, OBJS_PER_CLINE) /
>OBJS_PER_CLINE"

In this loop i will be incremented by 1.  But the for loop which I mentioned 
above, also does the same thing but i is incremented by OBJS_PER_CLINE.
>
>
>       >
>       >> +
>       >> +#if RTE_GRAPH_BURST_SIZE > 64
>       >> +    for (i = 0; i < 4 && i < n_left_from; i++) {
>       >> +            rte_prefetch0(pkts[i]);
>       >> +            rte_prefetch0(rte_pktmbuf_mtod_offset(pkts[i], void *,
>       >> +                                    sizeof(struct rte_ether_hdr)));
>       >
>       >This construction does not make sense to me. Same as similar
>constructions
>       >below
>       >
>       >The second prefetch has memory dependency of the data, that will
>be
>       >prefetched by the first one. Does removing this prefetch affects
>performance?
>       The first prefetches the data at mbuf address. The second prefetch is
>for the address containing ipv4 header. Both can be in separate cache lines, as
>ipv4 starts at mbuf->buf_addr + mbuf->data_off + sizeof(ethernet header).
>data_off is generally 64 bytes or 128 bytes depending on cache line size.
>       >
>
>
>
>Indeed, both of them are in separate cache lines. But my point here is about
>data dependency for the second prefetch. In order to issue the prefetch
>instruction for the cache line containing the ipv4 header you must know the
>address. You calculate this address by:
>rte_pktmbuf_mtod_offset(pkts[i], void *, sizeof(struct rte_ether_hdr))
>rte_pktmbuf_mtod_offset is accessing mbuf->buf_addr and mbuf->data_off ,
>as you mentioned. But, these fields are not in your L1 cache at the moment,
>because you just asked to prefetch them with the previous instruction.
Ack.
>So my suggestion here would be to look at how it is done in
>ipv4_lookup_sse.c:ip4_lookup_node_process_vec()
>Simplifying, in a run loop it prefetches mbuf + 2, then prefetches CL with the 
>v4
>header of the mbuf + 1 (assuming in a previous invocation mbuf CL was
>already fetched into L1), and finally does processing of the mbuf + 0 (again,
>assuming the previous iteration ipv4 CL was fetched).
I have seen the implementation. The code looks fine to me. I will try to add 
similar code in fib lookup.
>
>
>       >> +    }
>       >> +#endif
>       >> +
>       >> +    i = 0;
>       >> +    while (n_left_from >= 4) {
>       >> +#if RTE_GRAPH_BURST_SIZE > 64
>       >> +            if (likely(n_left_from > 7)) {
>       >> +                    rte_prefetch0(pkts[4]);
>       >> +                    rte_prefetch0(rte_pktmbuf_mtod_offset(pkts[4], 
> void
>       >*,
>       >> +                                    sizeof(struct rte_ether_hdr)));
>       >> +                    rte_prefetch0(pkts[5]);
>       >> +                    rte_prefetch0(rte_pktmbuf_mtod_offset(pkts[5], 
> void
>       >*,
>       >> +                                    sizeof(struct rte_ether_hdr)));
>       >> +                    rte_prefetch0(pkts[6]);
>       >> +                    rte_prefetch0(rte_pktmbuf_mtod_offset(pkts[6], 
> void
>       >*,
>       >> +                                    sizeof(struct rte_ether_hdr)));
>       >> +                    rte_prefetch0(pkts[7]);
>       >> +                    rte_prefetch0(rte_pktmbuf_mtod_offset(pkts[7], 
> void
>       >*,
>       >> +                                    sizeof(struct rte_ether_hdr)));
>       >> +            }
>       >> +#endif
>       >> +
>       >> +            mbuf0 = pkts[0];
>       >> +            mbuf1 = pkts[1];
>       >> +            mbuf2 = pkts[2];
>       >> +            mbuf3 = pkts[3];
>       >> +            pkts += 4;
>       >> +            n_left_from -= 4;
>       >> +            /* Extract DIP of mbuf0 */
>       >> +            ipv4_hdr = rte_pktmbuf_mtod_offset(mbuf0, struct
>       >rte_ipv4_hdr *,
>       >> +                            sizeof(struct rte_ether_hdr));
>       >> +            /* Extract cksum, ttl as ipv4 hdr is in cache */
>       >> +            node_mbuf_priv1(mbuf0, dyn)->cksum = ipv4_hdr-
>       >>hdr_checksum;
>       >> +            node_mbuf_priv1(mbuf0, dyn)->ttl = ipv4_hdr-
>>time_to_live;
>       >> +
>       >> +            ip[i++] = rte_be_to_cpu_32(ipv4_hdr->dst_addr);
>       >> +
>       >> +            /* Extract DIP of mbuf1 */
>       >> +            ipv4_hdr = rte_pktmbuf_mtod_offset(mbuf1, struct
>       >rte_ipv4_hdr *,
>       >> +                            sizeof(struct rte_ether_hdr));
>       >> +            /* Extract cksum, ttl as ipv4 hdr is in cache */
>       >> +            node_mbuf_priv1(mbuf1, dyn)->cksum = ipv4_hdr-
>       >>hdr_checksum;
>       >> +            node_mbuf_priv1(mbuf1, dyn)->ttl = ipv4_hdr-
>>time_to_live;
>       >> +
>       >> +            ip[i++] = rte_be_to_cpu_32(ipv4_hdr->dst_addr);
>       >> +
>       >> +            /* Extract DIP of mbuf2 */
>       >> +            ipv4_hdr = rte_pktmbuf_mtod_offset(mbuf2, struct
>       >rte_ipv4_hdr *,
>       >> +                            sizeof(struct rte_ether_hdr));
>       >> +            /* Extract cksum, ttl as ipv4 hdr is in cache */
>       >> +            node_mbuf_priv1(mbuf2, dyn)->cksum = ipv4_hdr-
>       >>hdr_checksum;
>       >> +            node_mbuf_priv1(mbuf2, dyn)->ttl = ipv4_hdr-
>>time_to_live;
>       >> +
>       >> +            ip[i++] = rte_be_to_cpu_32(ipv4_hdr->dst_addr);
>       >> +
>       >> +            /* Extract DIP of mbuf3 */
>       >> +            ipv4_hdr = rte_pktmbuf_mtod_offset(mbuf3, struct
>       >rte_ipv4_hdr *,
>       >> +                            sizeof(struct rte_ether_hdr));
>       >> +
>       >> +            /* Extract cksum, ttl as ipv4 hdr is in cache */
>       >> +            node_mbuf_priv1(mbuf3, dyn)->cksum = ipv4_hdr-
>       >>hdr_checksum;
>       >> +            node_mbuf_priv1(mbuf3, dyn)->ttl = ipv4_hdr-
>>time_to_live;
>       >> +
>       >> +            ip[i++] = rte_be_to_cpu_32(ipv4_hdr->dst_addr);
>       >> +    }
>       >> +    while (n_left_from > 0) {
>       >> +            mbuf0 = pkts[0];
>       >> +            pkts += 1;
>       >> +            n_left_from -= 1;
>       >> +
>       >> +            /* Extract DIP of mbuf0 */
>       >> +            ipv4_hdr = rte_pktmbuf_mtod_offset(mbuf0, struct
>       >rte_ipv4_hdr *,
>       >> +                            sizeof(struct rte_ether_hdr));
>       >> +            /* Extract cksum, ttl as ipv4 hdr is in cache */
>       >> +            node_mbuf_priv1(mbuf0, dyn)->cksum = ipv4_hdr-
>       >>hdr_checksum;
>       >> +            node_mbuf_priv1(mbuf0, dyn)->ttl = ipv4_hdr-
>>time_to_live;
>       >> +
>       >> +            ip[i++] = rte_be_to_cpu_32(ipv4_hdr->dst_addr);
>       >> +    }
>       >> +
>       >> +    rc = rte_fib_lookup_bulk(fib, ip, next_hop, nb_objs);
>       >> +    if (unlikely(rc != 0))
>       >> +            return 0;
>       >> +
>       >> +    for (i = 0; i < nb_objs; i++) {
>       >> +            if (unlikely(next_hop[i] == FIB_DEFAULT_NH)) {
>       >> +                    next_hop[i] = drop_nh;
>       >maybe it is worth just do define FIB_DEFAULT_NH as a drop_nh vaue
>to omit
>       >these next_hop reassignments?
>       Ack.
>       >> +                    lookup_err += 1;
>       >> +            }
>       >> +
>       >> +            mbuf0 = (struct rte_mbuf *)objs[i];
>       >> +            node_mbuf_priv1(mbuf0, dyn)->nh =
>(uint16_t)next_hop[i];
>       >> +            next = (uint16_t)(next_hop[i] >> 16);
>       >> +
>       >> +            if (unlikely(next_index ^ next)) {
>       >> +                    /* Copy things successfully speculated till now 
> */
>       >> +                    rte_memcpy(to_next, from, last_spec *
>       >sizeof(from[0]));
>       >> +                    from += last_spec;
>       >> +                    to_next += last_spec;
>       >> +                    held += last_spec;
>       >> +                    last_spec = 0;
>       >> +
>       >> +                    rte_node_enqueue_x1(graph, node, next, from[0]);
>       >> +                    from += 1;
>       >> +            } else {
>       >> +                    last_spec += 1;
>       >> +            }
>       >> +    }
>       >> +
>       >> +    /* !!! Home run !!! */
>       >> +    if (likely(last_spec == nb_objs)) {
>       >> +            rte_node_next_stream_move(graph, node, next_index);
>       >> +            return nb_objs;
>       >> +    }
>       >> +
>       >> +    NODE_INCREMENT_XSTAT_ID(node, 0, lookup_err != 0,
>lookup_err);
>       >> +    held += last_spec;
>       >> +    rte_memcpy(to_next, from, last_spec * sizeof(from[0]));
>       >> +    rte_node_next_stream_put(graph, node, next_index, held);
>       >> +
>       >> +    return nb_objs;
>       >> +}
>       >> +
>       >>
>RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_node_ip4_fib_route_add, 25.07)
>       >>   int
>       >>   rte_node_ip4_fib_route_add(uint32_t ip, uint8_t depth, uint16_t
>       >> next_hop, @@ -147,6 +310,7 @@ static struct rte_node_xstats
>       >ip4_lookup_fib_xstats = {
>       >>   };
>       >>
>       >>   static struct rte_node_register ip4_lookup_fib_node = {
>       >> +    .process = ip4_lookup_fib_node_process,
>       >>      .name = "ip4_lookup_fib",
>       >>
>       >>      .init = ip4_lookup_fib_node_init,
>       >
>       >--
>       >Regards,
>       >Vladimir
>
>
>
>
>
>--
>
>Regards,
>
>Vladimir

Regards,
Ankur

Reply via email to