yttdj opened a new issue, #13739: URL: https://github.com/apache/skywalking/issues/13739
### Search before asking - [x] I had searched in the [issues](https://github.com/apache/skywalking/issues?q=is%3Aissue) and found no similar issues. ### Apache SkyWalking Component OAP server (apache/skywalking) ### What happened After upgrading OAP from 10.2.0 to 10.3.0, the TTL cleanup task no longer runs in Kubernetes cluster mode. We are using multiple OAP replicas with the kubernetes cluster coordinator. After the upgrade, every OAP instance logs the following message and skips TTL deletion: ```text The selected first getAddress is <pod-ip>:11800. The remove stage is skipped. ``` As a result, expired metrics are never removed from storage. This issue does not happen on 10.2.0 with the same deployment topology and configuration. From code analysis, DataTTLKeeperTimer itself did not change materially. The behavior change comes from the Kubernetes cluster coordinator in 10.3.0. In 10.2.0, queryRemoteNodes() returned the real Pod IP list and marked one of them as isSelf=true, so sorting the addresses selected one stable leader node. In 10.3.0, the Kubernetes coordinator now collects only remote Pod endpoints from Kubernetes, excludes the current Pod by UID, and then appends a synthetic self node as 127.0.0.1:<grpc-port>. Later, DataTTLKeeperTimer sorts the node list by Address.toString(), which is lexicographical string sorting, not IP-aware sorting. In common Kubernetes Pod CIDR ranges such as 10.x.x.x, all remote addresses sort before 127.0.0.1, so on every replica the first node is always a remote node and isSelf() is always false. Therefore every replica skips the TTL task. This makes the bug reproducible in multi-replica Kubernetes deployments after upgrading to 10.3.0. ### What you expected to happen Exactly one OAP instance should execute the TTL cleanup task, as in 10.2.0. ### How to reproduce 1. Deploy SkyWalking OAP 10.2.0 in Kubernetes with the kubernetes cluster coordinator and at least 2 OAP replicas. 2. Use a Pod network with 10.x.x.x addresses. 3. Verify TTL cleanup works normally. 4. Upgrade the same deployment to 10.3.0. 5. Wait for the TTL scheduled task to run. 6. Observe that every OAP instance logs: `The selected first getAddress is <pod-ip>:11800. The remove stage is skipped.` 7. Observe that expired metrics are not deleted. This is also reproducible with 3 replicas or more. ### Anything else Relevant code paths: - DataTTLKeeperTimer#delete - KubernetesCoordinator#queryRemoteNodes - KubernetesCoordinator#start - KubernetesLabelSelectorEndpointGroup#updateEndpoints - Address#compareTo Relevant behavior difference: - 10.2.0: self node uses the real Pod IP and is part of the same sorted address set. - 10.3.0: self node is appended as 127.0.0.1, which changes leader selection semantics. A possible fix would be to preserve the real Pod IP for the self node in the Kubernetes coordinator, instead of using 127.0.0.1. Related historical issues with similar symptoms: - #6804 - #6828 Potential regression source: - PR #13493 ### Are you willing to submit a pull request to fix on your own? - [ ] Yes I am willing to submit a pull request on my own! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
