yttdj opened a new issue, #13739:
URL: https://github.com/apache/skywalking/issues/13739

   ### Search before asking
   
   - [x] I had searched in the 
[issues](https://github.com/apache/skywalking/issues?q=is%3Aissue) and found no 
similar issues.
   
   
   ### Apache SkyWalking Component
   
   OAP server (apache/skywalking)
   
   ### What happened
   
   After upgrading OAP from 10.2.0 to 10.3.0, the TTL cleanup task no longer 
runs in Kubernetes cluster mode.
   
   We are using multiple OAP replicas with the kubernetes cluster coordinator. 
After the upgrade, every OAP instance logs the following message and skips TTL 
deletion:
   
   ```text
   The selected first getAddress is <pod-ip>:11800. The remove stage is skipped.
   ```
   
   As a result, expired metrics are never removed from storage.
   
   This issue does not happen on 10.2.0 with the same deployment topology and 
configuration.
   
   From code analysis, DataTTLKeeperTimer itself did not change materially. The 
behavior change comes from the Kubernetes cluster coordinator in 10.3.0.
   
   In 10.2.0, queryRemoteNodes() returned the real Pod IP list and marked one 
of them as isSelf=true, so sorting the addresses selected one stable leader 
node.
   
   In 10.3.0, the Kubernetes coordinator now collects only remote Pod endpoints 
from Kubernetes, excludes the current Pod by UID, and then appends a synthetic 
self node as 127.0.0.1:<grpc-port>.
   
   Later, DataTTLKeeperTimer sorts the node list by Address.toString(), which 
is lexicographical string sorting, not IP-aware sorting.
   
   In common Kubernetes Pod CIDR ranges such as 10.x.x.x, all remote addresses 
sort before 127.0.0.1, so on every replica the first node is always a remote 
node and isSelf() is always false. Therefore every replica skips the TTL task.
   
   This makes the bug reproducible in multi-replica Kubernetes deployments 
after upgrading to 10.3.0.
   
   ### What you expected to happen
   
   Exactly one OAP instance should execute the TTL cleanup task, as in 10.2.0.
   
   ### How to reproduce
   
   1. Deploy SkyWalking OAP 10.2.0 in Kubernetes with the kubernetes cluster 
coordinator and at least 2 OAP replicas.
   2. Use a Pod network with 10.x.x.x addresses.
   3. Verify TTL cleanup works normally.
   4. Upgrade the same deployment to 10.3.0.
   5. Wait for the TTL scheduled task to run.
   6. Observe that every OAP instance logs:
      `The selected first getAddress is <pod-ip>:11800. The remove stage is 
skipped.`
   7. Observe that expired metrics are not deleted.
   
   This is also reproducible with 3 replicas or more.
   
   ### Anything else
   
   Relevant code paths:
   - DataTTLKeeperTimer#delete
   - KubernetesCoordinator#queryRemoteNodes
   - KubernetesCoordinator#start
   - KubernetesLabelSelectorEndpointGroup#updateEndpoints
   - Address#compareTo
   
   Relevant behavior difference:
   - 10.2.0: self node uses the real Pod IP and is part of the same sorted 
address set.
   - 10.3.0: self node is appended as 127.0.0.1, which changes leader selection 
semantics.
   
   A possible fix would be to preserve the real Pod IP for the self node in the 
Kubernetes coordinator, instead of using 127.0.0.1.
   
   Related historical issues with similar symptoms:
   - #6804
   - #6828
   
   Potential regression source:
   - PR #13493
   
   ### Are you willing to submit a pull request to fix on your own?
   
   - [ ] Yes I am willing to submit a pull request on my own!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 
[email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to