Hi All,

Recently, I met a problem of SCTP association broken which was resulted from 
large SCTP packets as attached in this mail.
Because the 1st packet’s length is 1626 that exceeds the next hop’s MTU of 
1500, an ICMP packet of code 4 (Fragmentation needed) reflects back and carries 
the correct MTU value 1500. However, the mechanism of auto-adjusting PMTU 
doesn’t work. Having debugged the kernel, I find the ICMP packet is dropped at 
a pre-routing net filter called ‘nft_chain_nat_ipv4’ due to 
‘CONFIG_NFT_CHAIN_NAT_IPV4’ being enabled. Below is the calling sequence:
PATH1:     NF_INET_PRE_ROUTING → nft_nat_ipv4_in → nf_nat_ipv4_in → 
nf_nat_ipv4_fn → nf_nat_icmp_reply_translation → nf_nat_ipv4_manip_pkt
PATH2:     NF_INET_PRE_ROUTING → nft_nat_ipv4_in → nf_nat_ipv4_in → 
nf_nat_ipv4_fn → nf_nat_packet → l3proto->manip_pkt(nf_nat_ipv4_manip_pkt)
COMMON:  nf_nat_ipv4_manip_pkt → l4proto->manip_pkt(sctp_manip_pkt) → 
skb_make_writable

To reach the final function ‘skb_make_writable’ in this calling chain, the ICMP 
packet and various header pointers can be depicted as below:
MAC(l2) + [VLAN(l2)] + IP(l3) + ICMP(l4) + { payload ⇒ IP + SCTP }            
And the input parameter ‘hdroff’ now equals to the length from ‘skb->data’ to 
the SCTP header in the ICMP payload.
So, the statement ‘skb_make_writable(skb, hdroff + sizeof(*hdr))’ assumes that 
the SCTP header is intact and whole. However, certain network elements (routes, 
gateways, or something like that) probably send ICMP only containing extra 8 
bytes (64 bits) after the IP header of original packet. Just as the attachment 
shown, the ICMP only contained the source port, destination port and SCTP 
verification tag of the partial (8 bytes) SCTP header in the previous SCTP 
packet. Such the case can make ‘skb_make_writable’ return false. And then, the 
ICMP packet will be dropped. Finally, the upper layer’s ‘err_handler’ would not 
be triggered to notify SCTP for updating the PMTU.

I compare it with how the TCP protocol is handled. In the file 
‘net/netfilter/nf_nat_proto_tcp.c’, there’s also a similar function called 
‘tcp_manip_pkt’, and a paragraph of commence describing as below:
     int hdrsize = 8; /* TCP connection tracking guarantees this much */
     
    /* this could be a inner header returned in icmp packet; in such
       cases we cannot update the checksum field since it is outside of
       the 8 bytes of transport layer headers we are guaranteed */
    if (skb->len >= hdroff + sizeof(struct tcphdr))
        hdrsize = sizeof(struct tcphdr);

    if (!skb_make_writable(skb, hdroff + hdrsize))
        return false;
……………………… and later …………………………
    if (hdrsize < sizeof(*hdr))
        return true;

I think that ‘sctp_manip_pkt’ should also behave like this. Isn’t it?

Best regards,
Richard

Attachment: icmp_pmtu.pcap
Description: icmp_pmtu.pcap

Attachment: icmp_pmtu.rar
Description: icmp_pmtu.rar

Reply via email to