Bram Yvahk wrote: > Steffen Klassert wrote: >> On Sun, Mar 17, 2019 at 11:37:55PM +0000, Bram Yvahk wrote: >>> We've experienced an issue with VTI when the path-mtu is smaller than > the size >>> of the "client" packet. >>> >>> What happens: IPv4 packet from the client (i.e. another system in the > LAN) >>> attempts to transmit some data; IPv4 header shows that 'DF' bit is > not set but >>> still the client receives ICMPv4 "need-to-frag" message [which the > client does >>> not expect and ignores]. >>> >>> Example: $ ping -s 1300 -M dont -c5 192.168.235.2 >>> PING 192.168.235.3 (192.168.235.3) 1300(1328) bytes of data. >>> From 192.168.236.254 icmp_seq=1 Frag needed and DF set (mtu = 1214) >>> From 192.168.236.254 icmp_seq=2 Frag needed and DF set (mtu = 1214) >>> From 192.168.236.254 icmp_seq=3 Frag needed and DF set (mtu = 1214) >>> From 192.168.236.254 icmp_seq=4 Frag needed and DF set (mtu = 1214) >>> From 192.168.236.254 icmp_seq=5 Frag needed and DF set (mtu = 1214) >>> >>> --- 192.168.235.3 ping statistics --- >>> 5 packets transmitted, 0 received, +5 errors, 100% packet loss, > time 3999ms >> Hm, this works here. Can you show how you setup the vti device? >> Some tunnel configuration options (set ttl etc.) force to have >> the DF bit set. > > I will provide these details Tommorow. > What I can say is that ttl was set to inherit. >
vti device is created (on Gateway A) using: $ ip tun add name vti0 mode vti ikey 1 okey 1 local <ip gateway A> $ ip link show dev vti0 46: vti0@NONE: <NOARP,UP,LOWER_UP> mtu 1480 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/ipip <ip gateway A> brd 0.0.0.0 $ ip tun show name vti0 vti0: ip/ip remote any local <ip gateway A> ttl inherit key 1 [I've also done setup with mtu 1400 - all remains the same] xfrm state: src <ip gateway B> dst <ip gateway A> proto esp spi 0xcd76a4a9 reqid 16389 mode tunnel replay-window 32 flag nopmtudisc af-unspec auth-trunc hmac(sha1) 0x08e1ce16b1f7f9039f9cc7421cf61010c029efc3 96 enc cbc(aes) 0x22c7aacd9680a10a52b0c5670b7d850c35ba17f7c7dc6c963252cdc311b1f4d5 anti-replay context: seq 0x0, oseq 0x0, bitmap 0x00000000 src <ip gateway A> dst <ip gateway B> proto esp spi 0x8f2988c7 reqid 16389 mode tunnel replay-window 32 flag nopmtudisc af-unspec auth-trunc hmac(sha1) 0x229bbe490606ddcc6a68332babd498001591c6bf 96 enc cbc(aes) 0xd598dba419bfc45232580e54d517aae6a77c3328a51ebb3321802b89cc51ae43 anti-replay context: seq 0x0, oseq 0x0, bitmap 0x00000000 (same behaviour with/without nopmtudisc; nopmtudisc only makes a difference for packets from 'client A' that *do* have the DF bit set) > > When testing this there is one important bit - which in hindsight I > should've included in the previous message - the (IPsec) Gateway A > needs to know the path-mtu to (IPsec) Gateway B. > > Some ways to accomplish this: > - transmit a ICMP with DF bit set and a larger packet size from > Gateway A to Gateway B > - ensure the "nopmtudisc" option is *not* set in the xfrm state > and then let client A transmit a ICMP *with* DF bit set to > client B. [when "nopmtudisc" is set then all outgoing IPv4 ESP > packet have the DF bit cleared, when "nopmtudisc" is not set then > DF bit is copied from the client packet] > > For testing purposes I recommend to do the ping from Gateway A to > Gateway B. (Otherwise tcpdumps/traffic get a bit more confusing.) > > A more in-depth description of what happens: > > Setup: > ====== > > |----------| |-----------| |-------| |-----------| |----------| > | client A |---| Gateway A |---| Hop H |---| Gateway B |---| client B | > ------------ |-----------| |-------| |-----------| |----------| > > - testing with linux 4.14.95 (setup with more recent kernel is WIP) > - link mtu between client A and Gateway A: 1500 > - link mtu between Gateway A and Hop H: 1500 > - link mtu between Hop H and Gateway B: 1280 > - link mtu between Gateway B and client B: 1500 > - path-mtu between Gateway A and Gateway B: 1280 > - IPsec tunnel over *IPv4* between Gateway A and Gateway B > - tunneling IPv4 over the IPsec tunnel > - testing with VTI > > Scenario: > ========== > > Before starting it's important to ensure that: > - Gateway A does *not* know the path-mtu to Gateway B > - Client A does *not* know the path-mtu to Gateway B On Gateway A: $ ip route get <ip of gateway B> <ip gateway B> via <hop H> dev eth1 src <ip gateway A> uid 0 cache => no mtu shown --> path-mtu not yet known > > * Step 1: client A: $ ping -M dont -s 1300 ip_of_client_B > - IPv4 ICMP packet of client A does not have DF bit set > - IPv4 ESP packet of Gateway A does not have DF bit set > - Hop H receives a IPv4 ESP packet that is too large for link-mtu > between Hop H and Gateway B: it fragments the IPv4 ESP packet. > - Gateway B receives 2 IPv4 fragmented packets > - (Client B receives one IPv4 ICMP packet from client A) tcpdump on Gateway A: - from client A it receives: IP (tos 0x0, ttl 64, id 46797, offset 0, flags [none], proto ICMP (1), length 1328) client_A > client_B: ICMP echo request, id 6855, seq 1, length 1308 - it transmits (to Gateway B): IP (tos 0x0, ttl 64, id 10932, offset 0, flags [none], proto ESP (50), length 1400) gateway_A > gateway_B: ESP(spi=0x8f2988c7,seq=0x3), length 1380 tcpdump on Gateway B: - it receives (from Gateway A): IP (tos 0x0, ttl 63, id 10932, offset 0, flags [+], proto ESP (50), length 1276) gateway_A > gateway_B: ESP(spi=0x8f2988c7,seq=0x3), length 1256 IP (tos 0x0, ttl 63, id 10932, offset 1256, flags [none], proto ESP (50), length 144) gateway_A > gateway_B: ip-proto-50 - it transmits (to client B): IP (tos 0x0, ttl 62, id 46797, offset 0, flags [none], proto ICMP (1), length 1328) client_A > client_B: ICMP echo request, id 6855, seq 1, length 1308 => Hop H fragmented the IPv4 packets. This is expected: DF bit is not set on ESP packets and Gateway A does not know path-mtu to Gateway B > > * Step 2: Gateway A: $ ping -M do -s 1300 ip_of_gateway_B > - IPv4 ICMP packet of Gateway A does have DF bit set > - Gateway A receives a 'need to frag' ICMP from Hop H tcpdump on Gateway A: - it transmits (local packet - to Gateway B): IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto ICMP (1), length 1328) gateway_A > gateway_B: ICMP echo request, id 28176, seq 1, length 1308 - it receives (from Hop H): IP (tos 0xc0, ttl 64, id 52788, offset 0, flags [none], proto ICMP (1), length 576) hop_H > gateway_A: ICMP 1.1.235.254 unreachable - need to frag (mtu 1280), length 556 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto ICMP (1), length 1328) gateway_A > gateway_B: ICMP echo request, id 28176, seq 1, length 1308 => Hop H send need-to-frag mtu. This expected: DF bit is set on ICMP packet so Hop H should not fragment. on Gateway A: $ ip route get <ip of gateway B> <ip gateway B> via <hop H> dev eth1 src <ip gateway A> uid 0 cache expires 17sec mtu 1280 => path-mtu known to be 1280 > * Step 3: client A: $ ping -M dont -s 1300 ip_of_client_B > - IPv4 ICMP packet of client A does not have DF bit set > - Gateway A: it process this packet in VTI module and detects that > packet size > path-mtu and then sends a 'need to frag' ICMP to > client A. [this is the code I patched] tcpdump on Gateway A: - from client A it receives: IP (tos 0x0, ttl 64, id 46798, offset 0, flags [none], proto ICMP (1), length 1328) client_A > client_B: ICMP echo request, id 7063, seq 1, length 1308 - it transmits to client A: IP (tos 0xc0, ttl 64, id 59290, offset 0, flags [none], proto ICMP (1), length 576) gateway_A > client_A: ICMP client_B unreachable - need to frag (mtu 1214), length 556 IP (tos 0x0, ttl 63, id 46798, offset 0, flags [none], proto ICMP (1), length 1328) client_A > client_B: ICMP echo request, id 7063, seq 1, length 1308 > > => the critical bit in the above is that Gateway A learns > the path-mtu to Gateway B. If it doesn't then it keeps > assuming path-mtu is 1500 and the check in VTI will not > trigger (since path-mtu of 1500 > packet size)