Hi again,

Following up with root cause analysis from a second reproduction (10 devices 
idle for ~24 hours, single Kamailio instance behind HAProxy).
Kamailio listens on two TLS ports: 5061 (SIP) and 50443 (WebSocket)
HAProxy proxies SIP traffic to 5061 and runs HTTP health checks (GET /health) 
against 50443
10 mobile clients registered via SIP/TLS through HAProxy

Root cause
The problem is caused by HAProxy's HTTP health checks to port 50443 colliding 
with existing SIP client connections on port 5061 in Kamailio's internal TCP 
alias hash table.
The alias hash table is keyed on (source_ip, source_port) only. it does not 
consider the destination port. 
When HAProxy opens a health check connection from the same ephemeral source 
port that's already in use for a SIP client connection, the two connections map 
to the same alias bucket. The health check connection's alias overwrites the 
SIP connection's alias. When the health check completes and the connection is 
destroyed, the alias entry is deleted, leaving the original SIP connection 
permanently unreachable by peer address.
This is valid at the TCP level because the full 4-tuple differs (same source, 
different destination port), but Kamailio's alias lookup only considers the 
source side.

Proof from logs: connection 5 (port 37190)
Step 0: SIP client connection established at startup:
tcpconn_new(): on port 37190, type 3, socket 81
tcpconn_add(): hashes: 2726:2132:2998, 5
Connection 5 created: 10.42.106.41:37190 → 10.42.106.25:5061. Alias added to 
hash bucket 2132.

Step 1: Alias works normally for hours:
_tcpconn_find(): found connection by peer address (id: 5)
In-dialog requests look up 10.42.106.41:37190, find connection 5 via alias. 
Delivered successfully.

Step 2: HAProxy health check arrives from the same source port:
tcpconn_new(): on port 37190, type 3, socket 81
tcpconn_add(): hashes: 2726:2132:3285, 384
send2child(): ... for activity on [tls:10.42.106.25:50443]
tls_accept: new connection from 10.42.106.41:37190 using TLSv1.3
New incoming connection 384: 10.42.106.41:37190 → 10.42.106.25:50443. Same 
alias hash bucket (2132). Connection 5's alias is overwritten.

Step 3: It's a health check, immediately closed:
parse_msg():  method: <GET>
parse_msg():  uri:    </health>
xhttp_handler(): new fake msg created:  
<GET /health HTTP/1.1   
Via: SIP/2.0/TLS 10.42.106.41:37190   
connection: close>
...
tls_h_tcpconn_close_f(): Closing SSL connection
handle_io(): removing from list ... ([10.42.106.41]:37190 -> 
[10.42.106.41]:50443)

Connection 384 is destroyed. Alias entry for 10.42.106.41:37190 is deleted from 
bucket 2132. Connection 5's alias is gone permanently.

Step 4: Next request to this user fails:
tcpconn_1st_send(): connect 10.42.106.41:37190 failed (RST) Connection refused
Kamailio can't find connection 5 by peer address. 
Opens a new outgoing connection to 10.42.106.41:37190, which HAProxy rejects 
with RST.

Questions
1. Is the alias hash table intentionally keyed on (source_ip, source_port) 
only, without considering the destination port? 
If so, is there a reason the destination side is excluded?

2. When a new incoming connection is added to a hash bucket that already 
contains an alias for the same (source_ip, source_port), does Kamailio replace 
the existing alias entry? The logs suggest it does, but I'd like to confirm.

3. Would it make sense to include the destination port (or the full 4-tuple) in 
the alias key to prevent this class of collision?

4. Regardless of the alias issue, does it make sense to keep 
tcp_set_otcpid_flag(1) after lookup() as a general best practice for 
deployments behind a TCP load balancer? It seems more robust to route by 
connection ID (which lookup() already populates via msg->otcpid) rather than 
relying on the alias hash table. Or are there downsides/edge cases where this 
flag should not be used?

Thanks,
Joey
__________________________________________________________
Kamailio - Users Mailing List - Non Commercial Discussions -- 
[email protected]
To unsubscribe send an email to [email protected]
Important: keep the mailing list in the recipients, do not reply only to the 
sender!

Reply via email to