wklken opened a new issue, #12436:
URL: https://github.com/apache/apisix/issues/12436
### Current Behavior
In some condition, when the ip of the domain changed, the apisix keep use
the old ip, cause 504 gateway timeout.
And it would never resume, until do `apisix reload`
At the same time, dig and nslookup command return the newest ip.
### Expected Behavior
apisix should detect the ip changed
### Error Logs
```
2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] upstream.lua:65:
parse_domain_for_nodes(): parse_domain_for_nodes:
[{"weight":100,"host":"10.105.226.135","domain":"httpbin","priority":1,"upstream_host":"httpbin","port":80}],
client: 10.244.2.240, server: _, request: "GET /api/apigw/prod/anything
HTTP/1.1", host: "bkapi.paasv3-dev.woa.com"
2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] upstream.lua:69:
parse_domain_for_nodes(): parse_domain_for_nodes: host=10.105.226.135, client:
10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1",
host: "bkapi.paasv3-dev.woa.com"
2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] upstream.lua:84:
parse_domain_for_nodes(): parse_domain_for_nodes: add the node back, client:
10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1",
host: "bkapi.paasv3-dev.woa.com"
2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] init.lua:213:
parse_domain_in_route(): parse_domain_in_route |
new_nodes=[{"weight":100,"host":"10.105.226.135","domain":"httpbin","priority":1,"upstream_host":"httpbin","port":80}],
client: 10.244.2.240, server: _, request: "GET /api/apigw/prod/anything
HTTP/1.1", host: "bkapi.paasv3-dev.woa.com"
2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] init.lua:219:
parse_domain_in_route(): parse_domain_in_route |
up_conf:{"timeout":{"send":30,"connect":30,"read":30},"hash_on":"vars","type":"roundrobin","parent":{"update_count":0,"modifiedIndex":5360,"orig_modifiedIndex":5360,"clean_handlers":{},"createdIndex":5360,"has_domain":true,"key":"/bk-gateway-apisix/routes/apigw.prod.2347","value":{"timeout":{"send":30,"connect":30,"read":30},"desc":"Returns
anything passed in request
data.","name":"apigw-prod-anything-get","labels":{"gateway.bk.tencent.com/stage":"prod","gateway.bk.tencent.com/gateway":"apigw"},"update_time":1752566944,"plugins":{"bk-proxy-rewrite":{"match_subpath":false,"uri":"/anything","subpath_param_name":":ext","method":"GET","use_real_request_uri_unsafe":false},"bk-resource-context":{"bk_resource_name":"anything_get","bk_resource_id":2347,"bk_resource_auth":{"verified_user_required":false,"resource_perm_required":false,"skip_user_verification":false,"verified_a
pp_required":false},"bk_resource_auth_obj":{"verified_user_required":false,"resource_perm_required":false,"skip_user_verification":false,"verified_app_required":false}}},"uris":["/api/apigw/prod/anything","/api/apigw/prod/anything/"],"upstream":{"timeout":"table:
0x7f119b810dd0","hash_on":"vars","type":"roundrobin","parent":"table:
0x7f1199322a98","original_nodes":[{"weight":100,"host":"10.105.226.135","domain":"httpbin","priority":1,"upstream_host":"httpbin","port":80}],"nodes":"table:
0x7f11693587e0","pass_host":"node","scheme":"http","nodes_ref":"table:
0x7f11693587e0"},"status":1,"id":"apigw.prod.2347","service_id":"apigw.prod.stage-4","priority":0,"methods":["GET"],"create_time":1752566944}},"original_nodes":"table:
0x7f11693587e0","nodes":"table:
0x7f11693587e0","pass_host":"node","scheme":"http","nodes_ref":"table:
0x7f11693587e0"}, client: 10.244.2.240, server: _, request: "GET
/api/apigw/prod/anything HTTP/1.1", host: "bkapi.paasv3-dev.woa.com"
2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] init.lua:221:
parse_domain_in_route(): parse_domain_in_route | compare result:true, client:
10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1",
host: "bkapi.paasv3-dev.woa.com"
2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] init.lua:223:
parse_domain_in_route(): parse_domain_in_route | no change, use old route,
client: 10.244.2.240, server: _, request: "GET /api/apigw/prod/anything
HTTP/1.1", host: "bkapi.paasv3-dev.woa.com"
```
### Steps to Reproduce
1. add a route with `route.upstream.nodes` and the `nodes[0].host =
httpbin`, which is a svc in k8s, route to the httpbin service
```
$ curl -H "X-API-KEY: $admin_key"
http://127.0.0.1:9180/apisix/admin/routes/apigw.prod.2347 | jq
{
"key": "/bk-gateway-apisix/routes/apigw.prod.2347",
"modifiedIndex": 5360,
"createdIndex": 5360,
"value": {
"timeout": {
"send": 30,
"connect": 30,
"read": 30
},
"desc": "Returns anything passed in request data.",
"name": "apigw-prod-anything-get",
"update_time": 1752566944,
"plugins": {
"proxy-rewrite": {
"method": "GET",
"uri": "/anything"
}
},
"create_time": 1752566944,
"upstream": {
"timeout": {
"send": 30,
"connect": 30,
"read": 30
},
"nodes": [
{
"weight": 100,
"priority": 1,
"port": 80,
"host": "httpbin"
}
],
"pass_host": "node",
"scheme": "http",
"type": "roundrobin"
},
"labels": {
"gateway.bk.tencent.com/stage": "prod",
"gateway.bk.tencent.com/gateway": "apigw"
},
"id": "apigw.prod.2347",
"service_id": "apigw.prod.stage-4",
"status": 1,
"methods": [
"GET"
],
"uris": [
"/api/apigw/prod/anything",
"/api/apigw/prod/anything/"
]
}
}
```
here, the route.upstream.nodes[0].host = httpbin`
2. add `core.log.error` for debug
apisix/init.lua
```lua
local function parse_domain_in_route(route)
local nodes = route.value.upstream.nodes
local new_nodes, err = upstream_util.parse_domain_for_nodes(nodes)
core.log.error("parse_domain_in_route | new_nodes=",
core.json.delay_encode(new_nodes, true))
if not new_nodes then
return nil, err
end
local up_conf = route.dns_value and route.dns_value.upstream
core.log.error("parse_domain_in_route | up_conf:",
core.json.delay_encode(up_conf, true))
local ok = upstream_util.compare_upstream_node(up_conf, new_nodes)
core.log.error("parse_domain_in_route | compare result:", ok)
if ok then
core.log.error("parse_domain_in_route | no change, use old route")
return route
end
-- don't modify the modifiedIndex to avoid plugin cache miss because of
DNS resolve result
-- has changed
-- Here we copy the whole route instead of part of it,
-- so that we can avoid going back from route.value to route during
copying.
route.dns_value = core.table.deepcopy(route).value
route.dns_value.upstream.nodes = new_nodes
core.log.info("parse route which contain domain: ",
core.json.delay_encode(route, true))
return route
end
```
and
apisix/utils/upstream.lua
```lua
local function parse_domain_for_nodes(nodes)
core.log.error("parse_domain_for_nodes: ", core.json.delay_encode(nodes,
true))
local new_nodes = core.table.new(#nodes, 0)
for _, node in ipairs(nodes) do
local host = node.host
core.log.error("parse_domain_for_nodes: host=", host)
if not ipmatcher.parse_ipv4(host) and
not ipmatcher.parse_ipv6(host) then
local ip, err = core.resolver.parse_domain(host)
if ip then
local new_node = core.table.clone(node)
new_node.host = ip
new_node.domain = host
core.table.insert(new_nodes, new_node)
end
if err then
core.log.error("dns resolver domain: ", host, " error: ",
err)
end
else
core.log.error("parse_domain_for_nodes: add the node back")
core.table.insert(new_nodes, node)
end
end
return new_nodes
end
_M.parse_domain_for_nodes = parse_domain_for_nodes
```
5. apisix reload and update routes in etcd, trigger `config_etcd.lua:389:
sync_data()`
6. at the same time, delete the httpbin service and kubectl apply it again
(the cluster ip would be changed) 【not 100% Reproducible】
7. curl it
-----
according to the error.log,
1. the `parse_domain-for_nodes` args 1 is
`[{"weight":100,"host":"10.105.226.135","domain":"httpbin","priority":1,"upstream_host":"httpbin","port":80}]`,
the host is a ip here
2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] upstream.lua:65:
parse_domain_for_nodes(): parse_domain_for_nodes:
[{"weight":100,"host":"10.105.226.135","domain":"httpbin","priority":1,"upstream_host":"httpbin","port":80}],
client: 10.244.2.240, server: _, request: "GET /api/apigw/prod/anything
HTTP/1.1", host: "bkapi.paasv3-dev.woa.com"
2. while it's not a domain, so it would not
`core.resolver.parse_domain(host)`
2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] upstream.lua:69:
parse_domain_for_nodes(): parse_domain_for_nodes: host=10.105.226.135, client:
10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1",
host: "bkapi.paasv3-dev.woa.com"
3. then it been added back
2025/07/16 09:41:20 [error] 6290#6290: *554164 [lua] upstream.lua:84:
parse_domain_for_nodes(): parse_domain_for_nodes: add the node back, client:
10.244.2.240, server: _, request: "GET /api/apigw/prod/anything HTTP/1.1",
host: "bkapi.paasv3-dev.woa.com"
------
so the worker would never detect the ip changes, until `apisix reload`
### Environment
- APISIX version (run `apisix version`): 3.2.1
- Operating system (run `uname -a`):
- OpenResty / Nginx version (run `openresty -V` or `nginx -V`):
- etcd version, if relevant (run `curl
http://127.0.0.1:9090/v1/server_info`):
- APISIX Dashboard version, if relevant:
- Plugin runner version, for issues related to plugin runners:
- LuaRocks version, for installation issues (run `luarocks --version`):
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]