[ 
https://issues.apache.org/jira/browse/FLINK-39086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastien Pereira updated FLINK-39086:
--------------------------------------
    Description: 
Job submission _via Flink CLI within  the cluster_ fails in Kubernetes HA mode 
because BlobServer returns short pod hostnames (e.g., 
{{{}flink-74b96b4f8-xc49r{}}}) instead of resolvable addresses. 
 * Job submission from inside the cluster fails with {{UnknownHostException}}
 * Affects FlinkDeployment CRs with HA enabled
 * Breaks CI/CD pipelines running in-cluster

h3. Suspected cause

FLINK-38109 [1] changed how BlobServer address resolution works in Flink 2.2. 
The new {{getBlobServerAddress()}} method [2] returns the pod's bind address 
(hostname), but in Kubernetes, short pod hostnames aren't DNS-resolvable 
without a headless service [3]. Flink 1.20 returned IP addresses which were 
directly resolvable.

The change [4] in {{JobSubmitHandler.java}} (lines 192-200) now uses 
{{gateway.getBlobServerAddress()}} instead of {{{}gateway.getHostname(){}}}.
h3. Reproduction

1/ Deploy Flink 2.2.0 with HA in Kubernetes:
 
{code:java}
apiVersion: flink.apache.org/v1beta1
kind: FlinkDeployment
metadata:
  name: flink-example
spec:
  image: flink:2.2.0
  flinkVersion: v2_2
  flinkConfiguration:
    high-availability.type: kubernetes
    high-availability.storageDir: file:///opt/flink/ha
  jobManager:
    resource:
      memory: "2048m"
  taskManager:
    resource:
      memory: "2048m"{code}
2/ Submit job from inside JobManager pod:
{code:java}
kubectl exec -n <namespace> <jobmanager-pod> -- \
  flink run /opt/flink/examples/streaming/StateMachineExample.jar{code}
 {*}Result{*}: Fails with UnknownHostException: flink-74b96b4f8-xc49r
{code:java}
Caused by: java.io.IOException: Could not connect to BlobServer at address 
flink-74b96b4f8-xc49r/<unresolved>:6124
    at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:103)
    at 
org.apache.flink.runtime.rest.handler.job.JobSubmitHandler.lambda$uploadJobGraphFiles$4(JobSubmitHandler.java:199)
Caused by: java.net.UnknownHostException: flink-74b96b4f8-xc49r{code}
The address flink-74b96b4f8-xc49r is a short pod hostname, not resolvable 
without FQDN suffix or additional Kubernetes configuration.
h3. Tested Workaround

Add a headless service. Example:
{code:java}
apiVersion: v1
kind: Service
metadata:
  name: flink-headless
spec:
  clusterIP: None
  publishNotReadyAddresses: true
  selector:
    app: flink
    component: jobmanager
  ports:
  - name: blob
    port: 6124
  - name: rpc
    port: 6123 {code}
 * resolves the DNS issue and allows job submission to work
 * Does not negatively impact deployments without HA enabled
 * Headless services are a standard Kubernetes pattern for enabling pod 
hostname resolution [5][6]

h3. Follow-up / questions
 # Is returning pod short hostnames an intended behavior?
 # Are there side effects or edge cases with the headless service approach that 
need consideration?
 # Are there other solutions besides headless service?

If headless service is an acceptable solution:
 * Should it be documented as a requirement for HA in Kubernetes (and where)?
 * Should it be created automatically by the Flink Kubernetes Operator?

h2. References

[1] FLINK-38109 - https://issues.apache.org/jira/browse/FLINK-38109

[2] Commit introducing the change - 
[https://github.com/apache/flink/commit/deee02b665c01d56d43da4df2eadfbbef333ec3c]

[3] Kubernetes DNS for Services and Pods - Pod hostname DNS resolution - 
[https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-hostname-and-subdomain-field]

[4] Affected code - 
[https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/job/JobSubmitHandler.java#L192-L200]

[5] Kubernetes Documentation - Headless Services - 
[https://kubernetes.io/docs/concepts/services-networking/service/#headless-services]

[6] StatefulSets and Headless Services - 
[https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#stable-network-id]

 

  was:
Job submission _via Flink CLI within  the cluster_ fails in Kubernetes HA mode 
because BlobServer returns short pod hostnames (e.g., 
{{{}flink-74b96b4f8-xc49r{}}}) instead of resolvable addresses. 
 * Job submission from inside the cluster fails with {{UnknownHostException}}
 * Affects FlinkDeployment CRs with HA enabled
 * Breaks CI/CD pipelines running in-cluster

h3. Suspected cause

FLINK-38109 [1] changed how BlobServer address resolution works in Flink 2.2. 
The new {{getBlobServerAddress()}} method [2] returns the pod's bind address 
(hostname), but in Kubernetes, short pod hostnames aren't DNS-resolvable 
without a headless service [3]. Flink 1.20 returned IP addresses which were 
directly resolvable.

The change [4] in {{JobSubmitHandler.java}} (lines 192-200) now uses 
{{gateway.getBlobServerAddress()}} instead of {{{}gateway.getHostname(){}}}.
h3. Reproduction

1/ Deploy Flink 2.2.0 with HA in Kubernetes:
 
{code:java}
apiVersion: flink.apache.org/v1beta1
kind: FlinkDeployment
metadata:
  name: flink-example
spec:
  image: flink:2.2.0
  flinkVersion: v2_2
  flinkConfiguration:
    high-availability.type: kubernetes
    high-availability.storageDir: file:///opt/flink/ha
  jobManager:
    resource:
      memory: "2048m"
  taskManager:
    resource:
      memory: "2048m"{code}
2/ Submit job from inside JobManager pod:
{code:sh}
kubectl exec -n <namespace> <jobmanager-pod> -- \
  flink run /opt/flink/examples/streaming/StateMachineExample.jar{code}
 {*}Result{*}: Fails with UnknownHostException: flink-74b96b4f8-xc49r
{code:java}
Caused by: java.io.IOException: Could not connect to BlobServer at address 
flink-74b96b4f8-xc49r/<unresolved>:6124 at 
org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:103) at 
org.apache.flink.runtime.rest.handler.job.JobSubmitHandler.lambda$uploadJobGraphFiles$4(JobSubmitHandler.java:199)
 Caused by: java.net.UnknownHostException: flink-74b96b4f8-xc49r  {code}
The address flink-74b96b4f8-xc49r is a short pod hostname, not resolvable 
without FQDN suffix or additional Kubernetes configuration.
h3. Tested Workaround

Add a headless service. Example:
{code:java}
apiVersion: v1
kind: Service
metadata:
  name: flink-headless
spec:
  clusterIP: None
  publishNotReadyAddresses: true
  selector:
    app: flink
    component: jobmanager
  ports:
  - name: blob
    port: 6124
  - name: rpc
    port: 6123 {code}
 * resolves the DNS issue and allows job submission to work
 * Does not negatively impact deployments without HA enabled
 * Headless services are a standard Kubernetes pattern for enabling pod 
hostname resolution [5][6]

h3. Follow-up / questions
 # Is returning pod short hostnames an intended behavior?
 # Are there side effects or edge cases with the headless service approach that 
need consideration?
 # Are there other solutions besides headless service?

If headless service is an acceptable solution:
 * Should it be documented as a requirement for HA in Kubernetes (and where)?
 * Should it be created automatically by the Flink Kubernetes Operator?

h2. References

[1] FLINK-38109 - https://issues.apache.org/jira/browse/FLINK-38109

[2] Commit introducing the change - 
[https://github.com/apache/flink/commit/deee02b665c01d56d43da4df2eadfbbef333ec3c]

[3] Kubernetes DNS for Services and Pods - Pod hostname DNS resolution - 
[https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-hostname-and-subdomain-field]

[4] Affected code - 
[https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/job/JobSubmitHandler.java#L192-L200]

[5] Kubernetes Documentation - Headless Services - 
[https://kubernetes.io/docs/concepts/services-networking/service/#headless-services]

[6] StatefulSets and Headless Services - 
[https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#stable-network-id]

 


> BlobServer returns unresolvable pod hostname in Kubernetes HA mode, breaking 
> job submission (Regression from 1.20)
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-39086
>                 URL: https://issues.apache.org/jira/browse/FLINK-39086
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Kubernetes, Runtime / Coordination
>    Affects Versions: 2.2.0
>         Environment: * {*}Flink Version{*}: 2.2.0
>  * {*}Flink Operator Version{*}: 1.13.0
>  * {*}Deployment Mode{*}: Native Kubernetes with HA enabled
>            Reporter: Sebastien Pereira
>            Priority: Major
>              Labels: high-availability, kubernetes, regression
>
> Job submission _via Flink CLI within  the cluster_ fails in Kubernetes HA 
> mode because BlobServer returns short pod hostnames (e.g., 
> {{{}flink-74b96b4f8-xc49r{}}}) instead of resolvable addresses. 
>  * Job submission from inside the cluster fails with {{UnknownHostException}}
>  * Affects FlinkDeployment CRs with HA enabled
>  * Breaks CI/CD pipelines running in-cluster
> h3. Suspected cause
> FLINK-38109 [1] changed how BlobServer address resolution works in Flink 2.2. 
> The new {{getBlobServerAddress()}} method [2] returns the pod's bind address 
> (hostname), but in Kubernetes, short pod hostnames aren't DNS-resolvable 
> without a headless service [3]. Flink 1.20 returned IP addresses which were 
> directly resolvable.
> The change [4] in {{JobSubmitHandler.java}} (lines 192-200) now uses 
> {{gateway.getBlobServerAddress()}} instead of {{{}gateway.getHostname(){}}}.
> h3. Reproduction
> 1/ Deploy Flink 2.2.0 with HA in Kubernetes:
>  
> {code:java}
> apiVersion: flink.apache.org/v1beta1
> kind: FlinkDeployment
> metadata:
>   name: flink-example
> spec:
>   image: flink:2.2.0
>   flinkVersion: v2_2
>   flinkConfiguration:
>     high-availability.type: kubernetes
>     high-availability.storageDir: file:///opt/flink/ha
>   jobManager:
>     resource:
>       memory: "2048m"
>   taskManager:
>     resource:
>       memory: "2048m"{code}
> 2/ Submit job from inside JobManager pod:
> {code:java}
> kubectl exec -n <namespace> <jobmanager-pod> -- \
>   flink run /opt/flink/examples/streaming/StateMachineExample.jar{code}
>  {*}Result{*}: Fails with UnknownHostException: flink-74b96b4f8-xc49r
> {code:java}
> Caused by: java.io.IOException: Could not connect to BlobServer at address 
> flink-74b96b4f8-xc49r/<unresolved>:6124
>     at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:103)
>     at 
> org.apache.flink.runtime.rest.handler.job.JobSubmitHandler.lambda$uploadJobGraphFiles$4(JobSubmitHandler.java:199)
> Caused by: java.net.UnknownHostException: flink-74b96b4f8-xc49r{code}
> The address flink-74b96b4f8-xc49r is a short pod hostname, not resolvable 
> without FQDN suffix or additional Kubernetes configuration.
> h3. Tested Workaround
> Add a headless service. Example:
> {code:java}
> apiVersion: v1
> kind: Service
> metadata:
>   name: flink-headless
> spec:
>   clusterIP: None
>   publishNotReadyAddresses: true
>   selector:
>     app: flink
>     component: jobmanager
>   ports:
>   - name: blob
>     port: 6124
>   - name: rpc
>     port: 6123 {code}
>  * resolves the DNS issue and allows job submission to work
>  * Does not negatively impact deployments without HA enabled
>  * Headless services are a standard Kubernetes pattern for enabling pod 
> hostname resolution [5][6]
> h3. Follow-up / questions
>  # Is returning pod short hostnames an intended behavior?
>  # Are there side effects or edge cases with the headless service approach 
> that need consideration?
>  # Are there other solutions besides headless service?
> If headless service is an acceptable solution:
>  * Should it be documented as a requirement for HA in Kubernetes (and where)?
>  * Should it be created automatically by the Flink Kubernetes Operator?
> h2. References
> [1] FLINK-38109 - https://issues.apache.org/jira/browse/FLINK-38109
> [2] Commit introducing the change - 
> [https://github.com/apache/flink/commit/deee02b665c01d56d43da4df2eadfbbef333ec3c]
> [3] Kubernetes DNS for Services and Pods - Pod hostname DNS resolution - 
> [https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-hostname-and-subdomain-field]
> [4] Affected code - 
> [https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/job/JobSubmitHandler.java#L192-L200]
> [5] Kubernetes Documentation - Headless Services - 
> [https://kubernetes.io/docs/concepts/services-networking/service/#headless-services]
> [6] StatefulSets and Headless Services - 
> [https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#stable-network-id]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to