Bram Yvahk wrote: > (What I think should happen in this particular case: do not send a > PKT_TOOBIG to the client but instead transmit fragmented IPv6 ESP > packets to accommodate the path-mtu) A follow-up to clarify my thinking (since my original mail might not be clear enough).
Let me first start by stating some of the (imo) obvious things: - IPv4 can be fragmented by hops on the route - IPv6 can only be fragmented by the originating source - Minimum mtu for IPv4 is 576 - Minimum mtu for IPv6 is 1280 - IPsec has some overhead Setup: ====== |----------| |-----------| |-------| |-----------| |----------| | client A |---| Gateway A |---| Hop H |---| Gateway B |---| client B | ------------ |-----------| |-------| |-----------| |----------| - testing with linux 4.14.95 (setup with more recent kernel is WIP) - link mtu between client A and Gateway A: 1500 - link mtu between Gateway A and Hop H: 1500 - link mtu between Hop H and Gateway B: 1280 - link mtu between Gateway B and client B: 1500 - path-mtu between Gateway A and Gateway B: 1280 - IPsec tunnel over IPv6 between Gateway A and Gateway B - tunneling IPv4 over the IPsec tunnel - tunneling IPv6 over the IPsec tunnel - testing with XFRM (not with VTI since this has issues) - (ip_vti module not loaded) - (ip6_vti module not loaded) Example with IPv4: ================== Let's first take a look and see what happens with IPv4. (I know IPv4 can be fragmented by all hops but that's not relevant) - path-mtu between 'Gateway A' and 'Gateway B' is unknown - 'client A' sends a ICMP to 'client B': size 1300, DF bit *not* set * 'gateway A' encrypts this and transmits one IPv6 ESP packet (size of outgoing packet: 1380 bytes) * 'gateway A' receives PKT_TOOBIG ICMPv6 from 'Hop H' (max mtu: 1280) * 'gateway A' now knows the path-mtu (truncated) output from tcpdump: IP6: ESP(spi=0xeff48047,seq=0xa), length 1380 IP6: ICMP6, packet too big, mtu 1280, length 1240 - path-mtu between 'Gateway A' and 'Gateway B' is known - 'client A' sends a ICMP to 'client B': size 1300, DF bit *not* set * 'gateway A' encrypts this and transmits two fragmented IPv6 packets (truncated) output from tcpdump: IP6: frag (0|1232) ESP(spi=0xeff48047,seq=0xb), length 1232 IP6: frag (1232|148) ==> the IPv4 packet was *not* fragmented, the encrypted data [which is the IPv4 packet] was transmitted as two fragmented packets by 'Gateway A'. ('Gateway A' is the originator of the ESP packet) Example with IPv6: ================== Now let's compare this with IPv6. Only the originating source can fragment the packets. - path-mtu between 'Gateway A' and 'Gateway B' is unknown - 'client A' sends a ICMPv6 to 'client B': size 1300 * 'gateway A' encrypts this and transmits one IPv6 ESP packet (size of outgoing packet: 1396 bytes) * 'gateway A' receives PKT_TOOBIG ICMPv6 from 'Hop H' (max mtu: 1280) * 'gateway A' now knows the path-mtu (truncated) output from tcpdump: IP6: ESP(spi=0xeff48048,seq=0x5), length 1396 IP6: ICMP6, packet too big, mtu 1280, length 1240 - 'client A' sends a ICMPv6 to 'client B': size 1300 * 'client A' receives PKT_TOO_BIG ICMPv6 from 'Gateway A': max 1198 IP6: ICMP6, echo request, seq 1, length 1300 IP6: ICMP6, packet too big, mtu 1198, length 1240 - gateway A' sending a ICMPv6 to 'client B': this now fails regardless of the size (even with -s 1)... (sendto call returns EINVAL); a ping from 'client A' to 'client B' still results in the PKT_TOOBIG; only way to fix this papers to be to make the kernel forget the path-mtu [this might be another bug? I could understand large packets not getting through but small ones? -- I'll verify this on a more recent kernel] What I would've expected to happen is that 'Gateway A' would send out two fragmented IPv6 packets containing the encrypted data. 'Gateway A' is the originator of the IPv6 ESP packet so it can fragment these. This similar to how it's done for IPv4. When the ESP is fragmented then the IPv6 packet from 'client A' is left intact/not fragmented. With my - limited - understanding of the IPv6 RFC I think this would be allowed. And just for the sake of argument: let's say the IPsec tunnel was not using IPv6 but IPv4: would it then be OK to fragment the IPv4 ESP packets when the encrypted data is an IPv6 packet? A very quick-and-dirty patch for which I do *not* know what impact it has: diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c index f112fef..066c311 100644 --- a/net/ipv6/esp6.c +++ b/net/ipv6/esp6.c @@ -684,14 +684,20 @@ static u32 esp6_get_mtu(struct xfrm_state *x, int mtu) struct crypto_aead *aead = x->data; u32 blksize = ALIGN(crypto_aead_blocksize(aead), 4); unsigned int net_adj; + int mtu2; if (x->props.mode != XFRM_MODE_TUNNEL) net_adj = sizeof(struct ipv6hdr); else net_adj = 0; - return ((mtu - x->props.header_len - crypto_aead_authsize(aead) - + mtu2 = ((mtu - x->props.header_len - crypto_aead_authsize(aead) - net_adj) & ~(blksize - 1)) + net_adj - 2; + + if (mtu2 < IPV6_MIN_MTU) { + return IPV6_MIN_MTU; + } + return mtu2; } static int esp6_err(struct sk_buff *skb, struct inet6_skb_parm *opt, => with this patch: the IPv6 ESP packet is now fragmented. i.e. a ping from 'client A' to 'client B': shows IP6: frag (0|1232) ESP(spi=0x410e6a38,seq=0x1a), length 1232 IP6: frag (1232|68) => same as IPv4