Baoyuantop commented on PR #13029:
URL: https://github.com/apache/apisix/pull/13029#issuecomment-4044429025
Thanks for the explanation. I agree that delayed_clear has a problem in this
scenario -- you're right that stale IPs can persist. But switching to clear()
introduces a different issue: it wipes the health status of all nodes
(including the ones that haven't changed), causing a temporary health status
reset on every upstream update.
Here's what happens with clear():
1. Upstream has nodes A, B, C, D (all with accumulated health status)
2. K8s scales down to A, B
3. clear() removes all target data from shm (A, B, C, D)
4. New checker re-adds A, B as healthy by default
5. If A was actually unhealthy, traffic briefly flows to a broken node until
the next health check cycle detects the failure again
The correct approach is a diff-based removal: compare the old target list
with the new node list, and only remove the nodes that no longer exist (C, D),
while preserving A and B's health status.
Something like this in timer_create_checker:
```
local existing_checker = working_pool[resource_path]
if existing_checker then
-- diff-based removal: only remove nodes that no longer exist
local new_nodes_set = {}
for _, node in ipairs(upstream.nodes) do
new_nodes_set[node.host .. ":" .. (node.port or "")] = true
end
for _, target in ipairs(existing_checker.checker.targets) do
local key = target.ip .. ":" .. (target.port or "")
if not new_nodes_set[key] then
existing_checker.checker:remove_target(target.ip, target.port,
target.hostname)
end
end
existing_checker.checker:stop()
end
```
This removes only the stale IPs while preserving health history for
unchanged nodes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]