On 8/3/20 11:53 PM, Stefano Brivio wrote: > It's currently possible to bridge Ethernet tunnels carrying IP > packets directly to external interfaces without assigning them > addresses and routes on the bridged network itself: this is the case > for UDP tunnels bridged with a standard bridge or by Open vSwitch. > > PMTU discovery is currently broken with those configurations, because > the encapsulation effectively decreases the MTU of the link, and > while we are able to account for this using PMTU discovery on the > lower layer, we don't have a way to relay ICMP or ICMPv6 messages > needed by the sender, because we don't have valid routes to it. > > On the other hand, as a tunnel endpoint, we can't fragment packets > as a general approach: this is for instance clearly forbidden for > VXLAN by RFC 7348, section 4.3: > > VTEPs MUST NOT fragment VXLAN packets. Intermediate routers may > fragment encapsulated VXLAN packets due to the larger frame size. > The destination VTEP MAY silently discard such VXLAN fragments. > > The same paragraph recommends that the MTU over the physical network > accomodates for encapsulations, but this isn't a practical option for > complex topologies, especially for typical Open vSwitch use cases. > > Further, it states that: > > Other techniques like Path MTU discovery (see [RFC1191] and > [RFC1981]) MAY be used to address this requirement as well. > > Now, PMTU discovery already works for routed interfaces, we get > route exceptions created by the encapsulation device as they receive > ICMP Fragmentation Needed and ICMPv6 Packet Too Big messages, and > we already rebuild those messages with the appropriate MTU and route > them back to the sender. > > Add the missing bits for bridged cases: > > - checks in skb_tunnel_check_pmtu() to understand if it's appropriate > to trigger a reply according to RFC 1122 section 3.2.2 for ICMP and > RFC 4443 section 2.4 for ICMPv6. This function is already called by > UDP tunnels > > - a new function generating those ICMP or ICMPv6 replies. We can't > reuse icmp_send() and icmp6_send() as we don't see the sender as a > valid destination. This doesn't need to be generic, as we don't > cover any other type of ICMP errors given that we only provide an > encapsulation function to the sender > > While at it, make the MTU check in skb_tunnel_check_pmtu() accurate: > we might receive GSO buffers here, and the passed headroom already > includes the inner MAC length, so we don't have to account for it > a second time (that would imply three MAC headers on the wire, but > there are just two). > > This issue became visible while bridging IPv6 packets with 4500 bytes > of payload over GENEVE using IPv4 with a PMTU of 4000. Given the 50 > bytes of encapsulation headroom, we would advertise MTU as 3950, and > we would reject fragmented IPv6 datagrams of 3958 bytes size on the > wire. We're exclusively dealing with network MTU here, though, so we > could get Ethernet frames up to 3964 octets in that case. > > v2: > - moved skb_tunnel_check_pmtu() to ip_tunnel_core.c (David Ahern) > - split IPv4/IPv6 functions (David Ahern) > > Signed-off-by: Stefano Brivio <sbri...@redhat.com> > --- > drivers/net/bareudp.c | 5 +- > drivers/net/geneve.c | 5 +- > drivers/net/vxlan.c | 4 +- > include/net/dst.h | 10 -- > include/net/ip_tunnels.h | 2 + > net/ipv4/ip_tunnel_core.c | 244 ++++++++++++++++++++++++++++++++++++++ > 6 files changed, 254 insertions(+), 16 deletions(-) >
Much easier to follow Reviewed-by: David Ahern <dsah...@gmail.com>