On 3/30/26 5:38 PM, Rukomoinikova Aleksandra wrote:
> On 27.03.2026 19:21, Dumitru Ceara wrote:
>> On 3/27/26 3:08 PM, Rukomoinikova Aleksandra wrote:
>>> On 27.03.2026 12:58, Dumitru Ceara wrote:
>>>> On 3/26/26 10:35 PM, Rukomoinikova Aleksandra wrote:
>>>>> Hi,
>>>>>
>>>> Hi Aleksandra,
>>>
>>> Hi Dumitru! Thank you for your answer!
>>>
>> Hi Aleksandra,
>>
>>> The main reason I didn’t consider conntrack on gw node is that I’m
>>> afraid return traffic
>>> might go through a different gateway. Like, SYN would go through one
>>> gateway, but SYN+ACK
>>> from server might come back through another gateway. In that case, next
>>> client packet
>>> would be dropped at first gateway because conntrack would consider it
>>> invalid.
>>>
> 
> Hi Dumitru! Sorry for the late reply!
> 

Hi Aleksandra,

> I think I didn’t explain the target setup very well. By scaling I meant 
> just adding more hosts as gw.
> I have two levels of routers, the first level hosts DGP, and the second 
> level already has load balancers attached.
> Packets are distributed from the fabric across hosts that run DGP, and 
> actual load balancing happens on second routers level connected to the 
> subnet switch.
> In this setup, return path would use ECMP route which doesn’t guarantee 
> that traffic returns through original GW.

Why not use ecmp-symmetric-reply for these ECMP routes to make the
traffic returns on the same path?

> I understand NAT will not work in this kind of architecture. We’re not 
> using NAT right now in this setup for lb since we want this for private 
> networks,
> but I think if this approach works, NAT could be addressed as well the 
> same way (like, save conntrack on node with vm).
> I don't know if there might be any other issues with asymmetry here yet.
> 
> I'll try to think about how to return the traffic to the original 
> gateway — not sure yet what the best approach is, but I like your idea 
> =) Hopefully something good will come out of it. Thanks!
> 

Thanks, looking forward to the next steps!

Regards,
Dumitru

> It would be great if it works out. The issues we had with the learn 
> action and the hash seem to be resolved now, but I'm still not quite 
> sure how to approach the consistent hashing part.
> 
>> Hmm, I didn't imagine this scenario.  Wouldn't this kind of asymmetric
>> forwarding cause issues anyway?  If it's a DGP, wouldn't reply traffic
>> from the backend come via the gateway anyway?
>>
>> Or do you want to do direct client return of the backend-side of things?
>>
>>> Or do you mean using it not as a “real” conntrack in usual sense, but
>>> just as a convenient
>>> base to extract packet metadata from?
>>> In that case, it is indeed convenient for removing inactive backends,
>>> but we would need to
>>> clean up conntrack entries ourselves after some time, because we can’t
>>> rely on the
>>> conntrack timers anymore since states won’t be updated properly.
>>> Updating conntrack entries from OVN probably also seems crazy
>>>
>>>>> I’d like to bring up an idea for discussion regarding implementation of 
>>>>> stateless load balancing
>>>>> support. I would really appreciate your feedback.I’m also open to 
>>>>> alternative approaches or ideas
>>>>> that I may not have considered. Thank you in advance for your time and 
>>>>> input!
>>>>>
>>>> Thanks for researching this, it's very nice work!
>>>>
>>>>> So, we would like to move towards stateless traffic load balancing. Here 
>>>>> is how it would differ
>>>>> from current approach:
>>>>> At the moment, when we have a load balancer on router with DGP ports, 
>>>>> conntrack state is stored
>>>>> directly on the gateway which hosting DGP right now. So, we select 
>>>>> backend for the first packet
>>>>> of a connection, and after that, based on existing conntrack entry, we no 
>>>>> longer perform backend
>>>>> selection. Instead, we rely entirely on stored conntrack record for 
>>>>> subsequent packets.
>>>>>
>>>>> One of the key limitations of this solution is that gateway nodes cannot 
>>>>> be horizontally scaled,
>>>>> since conntrack state is stored on a single node. Achieving ability to 
>>>>> horizontally scale gateway
>>>>> nodes is actually one of our goals.
>>>>>
>>>>> Here are a few possible approaches I see:
>>>>> 1) The idea of synchronizing conntrack state already exists in the 
>>>>> community, but it seems rather
>>>>>      outdated and not very promising.
>>>> Yeah, this on its own has a lot of potential scalability issues so we
>>>> never really pursued it I guess.
>>>>
>>>>> 2) Avoid storing conntrack state on GW node and instead perform stateless 
>>>>> load balancing for every
>>>>>      packet in connection, while keeping conntrack state on node where 
>>>>> virtual machine resides - This
>>>>>      is approach I am currently leaning toward.
>>>> Maybe we _also_ need to store conntrack state on the GW node, I'll
>>>> detail below.
>>>>
>>>>> Now in OVN we have use_stateless_nat option for lb, but it comes with 
>>>>> several limitations:
>>>>> 1) It works by selecting a backend in a stateless manner for every 
>>>>> packet, while for return traffic
>>>>>      it performs 1:1 snat. So, for traffic from backend (ip.src && 
>>>>> src.port), source ip is rewritten to
>>>>>      lb.vip. This only works correctly if a backend belongs to a single 
>>>>> lb.vip. Otherwise, there is a
>>>>>      risk that return traffic will be snated to wrong lb.vip.
>>>>> 2) There is also an issue with preserving tcp sessions when number of 
>>>>> backends changes. Since backend
>>>>>      selection is done using select(), any change in number of backends 
>>>>> can break existing tcp sessions.
>>>>>      This problem is also relevant to solution I proposed above—I will 
>>>>> elaborate on it below.
>>>>>
>>>>> More details about my idea (I will attach some code below—this is not 
>>>>> production-ready, just a
>>>>> prototype for testing [1]):
>>>>> 1. Packet arrives at gateway with as this one:
>>>>>      eth.dst == dgp_mac
>>>>>      eth.src == client_mac
>>>>>      ip.dst == lb.vip
>>>>>      ip.src == client_ip
>>>>>      tcp.dst == lb.port
>>>>>      tcp.src == client_port
>>>>> 2. We detect that packet is addressed to load balancer → perform select 
>>>>> over backends.
>>>>> 3. We route traffic directly to selected backend by changing eth.dst to 
>>>>> backend’s MAC,
>>>>>      while keeping `ip.dst == lb.vip`.
>>>> So what needs to be persisted is actually the MAC address of the backend
>>>> that was selected (through whatever hashing method).
>>>>
>>>>> 4. We pass packet further down processing pipeline. At this point, it 
>>>>> looks like:
>>>>>      eth.dst == backend_mac
>>>>>      eth.src == (source port of switch where backend is connected)
>>>>>      ip.dst == lb.vip
>>>>>      ip.src == client_ip
>>>>>      tcp.dst == lb.port
>>>>>      tcp.src == client_port
>>>>> 5. packet goes through ingress pipeline of switch where backed port 
>>>>> resides → then it is sent
>>>>>      through tunnel.
>>>> This is neat indeed because the LS doesn't care about the destination IP
>>>> but it cares about the destination MAC (for the L2 lookup stage)!
>>>>
>>>>> 6. packet arrives at node hosting backend vm, where we perform egress
>>>>>      load balancing and store conntrack state.
>>>>> 7. Server responds, and return traffic performs SNAT at ingress of switch 
>>>>> pipeline.
>>>>>
>>>>> This has already been implemented in code for testing and its working. 
>>>>> So, storing conntrack state on
>>>>> node where virtual machine resides - this helps address first limitation 
>>>>> of use_stateless_nat option.
>>>>>
>>>>> Regarding second issue: we need to ensure session persistence when number 
>>>>> of backends changes.
>>>>> In general, as I understand it, stateless load balancing in such systems 
>>>>> is typically based on
>>>>> consistent hashing. However, consistent hashing alone is not sufficient, 
>>>>> since it does not preserve
>>>>> 100% of connections. I see solution as some kind of additional layer on 
>>>>> top of consistent hashing.
>>>>>
>>>>> First, about consistent hashing in OVS, Correct me if I'm wrong:
>>>>> Currently, out of two hashing methods in OVS, only `hash` provides 
>>>>> consistency, since it is based
>>>>> on rendezvous hashing and relies on bucket id. For this to work 
>>>>> correctly, bucket_ids must be
>>>>> preserved when number of backends changes. However, OVN recreates 
>>>>> OpenFlow group every time number
>>>>> of backends changes. At moment, when number of backends changes, we 
>>>>> create a bundle where first piece
>>>>> is an ADD group message, and subsequent messages are INSERT_BUCKET/REMOVE 
>>>>> messages. If we could rewrite
>>>>> this part to support granular insertion/removal of backends in group — by 
>>>>> using INSERT_BUCKET/REMOVE
>>>>> without recreating group — we could make backend selection consistent for 
>>>>> hash method. I wasn’t able
>>>>> to fully determine whether there are any limitations to this approach, 
>>>>> but I did manually test session
>>>>> preservation using ovs-ofctl while i am inserting/removing buckets. When 
>>>>> removing buckets from a group,
>>>>> sessions associated with other buckets were preserved.
>>>>>
>>>>> At same time, I understand downsides of using hash: it is expensive, 
>>>>> since we install dp_flows that match
>>>>> full 5-tuple of connection, which leads to a large number of datapath 
>>>>> flows. Additionally, there is an
>>>>> upcall for every SYN packet. Because of this, I’m not sure how feasible 
>>>>> it is, but it might be worth
>>>>> thinking about ways to make dp_hash consistent. I don’t yet have a 
>>>>> concrete proposal here, but I’d
>>>>> really appreciate any ideas or suggestions in this direction.
>>>>>
>>>>> About additional layer on top of consistent hashing I’ve come up with two 
>>>>> potential approaches, and I’m not
>>>>> yet sure which one would be better:
>>>>> 1) Using learn action
>>>>> 2) Hash-based sticky sessions in OVS
>>>>>
>>>> I'd say we can avoid the need for consistent hashing if we add a third
>>>> alternative here:
>>>>
>>>> In step "3" in your idea above, on the GW node, after we selected the
>>>> backend (MAC) somehow (e.g., we could still just use dp-hash) we commit
>>>> that session to conntrack and store the MAC address as metadata (e.g. in
>>>> the ct_label like we do for ecmp-symmetric-reply routes).
>>>>
>>>> Subsequent packets on the same sessions don't actually need to be hashed
>>>> we'll get the MAC to be used as destination from the conntrack state so
>>>> any changes to the set of backends won't be problematic.  We would have
>>>> to flush conntrack for backends that get removed (we do that for regular
>>>> load balancers too).
>>>>
>>>> I know this might sound counter intuitive because your proposal was to
>>>> make the load balancing "stateless" on the GW node and I'm actually
>>>> suggesting the GW node processing to rely on conntrack (stateful).
>>>>
>>>> But..
>>>>
>>>> You mentioned that one of your goals is:
>>>>
>>>>> One of the key limitations of this solution is that gateway nodes cannot 
>>>>> be horizontally scaled,
>>>>> since conntrack state is stored on a single node. Achieving ability to 
>>>>> horizontally scale gateway
>>>>> nodes is actually one of our goals.
>>>> With my proposal I think you'll achieve that.  We'd have to be careful
>>>> with the case when a DGP moves to a different chassis (HA failover) but
>>>> for that we could combine this solution with using a consistent hash on
>>>> all chassis for the backend selection (something you mention you're
>>>> looking into anyway).
>>>>
>>>> Now, if your goal is to avoid conntrack completely on the gateway
>>>> chassis, my proposal breaks that.  Then we could implement a similar
>>>> solution as you suggested with learn action, that might be heavier on
>>>> the datapath than using conntrack.  I don't have numbers to back this up
>>>> but maybe we can find a way to benchmark it.
>>>>
>>>>> With an additional layer, we can account for hash rebuilds and 
>>>>> inaccuracies introduced by consistent
>>>>> hashing. It’s not entirely clear how to handle these cases, considering 
>>>>> that after some time we need to
>>>>> remove flows when using `learn`, or clean up connections in hash. We also 
>>>>> need to manage
>>>>> connection removal when number of backends changes.
>>>> Or when backends go away.. which makes it more complex to manage from
>>>> ovn-controller if we use learn flows, I suspect (if we use conntrack we
>>>> have the infra to do that already, we use it for regular load balancers
>>>> that have ct_flush=true).
>>>>
>>>>> By using such a two-stage approach, ideally, we would always preserve 
>>>>> session for a given connection,
>>>>> losing connections only in certain cases. For example, suppose we have 
>>>>> two GW nodes:
>>>>> * first SYN packet arrives at first gw — we record this connection in our 
>>>>> additional layer
>>>>>     (using learn or OVS hash) and continue routing all packets for this 
>>>>> connection based on it.
>>>>> * If a packet from this same connection arrives at second GW, additional 
>>>>> layer there
>>>>>     (learn or OVS hash) will not have session. If number of backends has 
>>>>> changed since backend was selected
>>>>>     for this connection on first GW, there is a chance it will not hit 
>>>>> same backend and will lost session.
>>>>>
>>>>> The downsides I see for first solution are:
>>>>> 1. More complex handling of cleanup for old connections and removal of 
>>>>> hashes for expired backends.
>>>>>
>>>>> The downsides of learn action are:
>>>>> 1. High load on OVS at a high connection rate.
>>>>> 2. If I understand correctly, this also results in a large number of 
>>>>> dp_flows that track each individual 5-tuple hash.
>>>>>
>>>>> I would be glad to hear your thoughts on a possible implementation of 
>>>>> dp_hash consistency, as well
>>>>> as feedback on overall architecture. These are all ideas I’ve been able 
>>>>> to develop so far. I would greatly
>>>>> appreciate any criticism, suggestions, or alternative approaches. It 
>>>>> would also be interesting to know
>>>>> whether anyone else is interested in this new optional mode for load 
>>>>> balancers =)
>>>>>
>>>>> [1] https://github.com/Sashhkaa/ovn/commits/deferred-lb-dnat/ - Here's 
>>>>> the code, just in case
>>>>>
>>>> Hope my thoughts above make some sense and that I don't just create
>>>> confusion. :)
>>>>
>> Regards,
>> Dumitru
>>
> 

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to