[
https://issues.apache.org/jira/browse/FLINK-39086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastien Pereira updated FLINK-39086:
--------------------------------------
Description:
Job submission _via Flink CLI within the cluster_ fails in Kubernetes HA mode
because BlobServer returns short pod hostnames (e.g.,
{{{}flink-74b96b4f8-xc49r{}}}) instead of resolvable addresses.
* Job submission from inside the cluster fails with {{UnknownHostException}}
* Affects FlinkDeployment CRs with HA enabled
* Breaks CI/CD pipelines running in-cluster
h3. Suspected cause
FLINK-38109 [1] changed how BlobServer address resolution works in Flink 2.2.
The new {{getBlobServerAddress()}} method [2] returns the pod's bind address
(hostname), but in Kubernetes, short pod hostnames aren't DNS-resolvable
without a headless service [3]. Flink 1.20 returned IP addresses which were
directly resolvable.
The change [4] in {{JobSubmitHandler.java}} (lines 192-200) now uses
{{gateway.getBlobServerAddress()}} instead of {{{}gateway.getHostname(){}}}.
h3. Reproduction
1/ Deploy Flink 2.2.0 with HA in Kubernetes:
{code:java}
apiVersion: flink.apache.org/v1beta1
kind: FlinkDeployment
metadata:
name: flink-example
spec:
image: flink:2.2.0
flinkVersion: v2_2
flinkConfiguration:
high-availability.type: kubernetes
high-availability.storageDir: file:///opt/flink/ha
jobManager:
resource:
memory: "2048m"
taskManager:
resource:
memory: "2048m"{code}
2/ Submit job from inside JobManager pod:
{code:java}
kubectl exec -n <namespace> <jobmanager-pod> -- \
flink run /opt/flink/examples/streaming/StateMachineExample.jar{code}
{*}Result{*}: Fails with UnknownHostException: flink-74b96b4f8-xc49r
{code:java}
Caused by: java.io.IOException: Could not connect to BlobServer at address
flink-74b96b4f8-xc49r/<unresolved>:6124
at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:103)
at
org.apache.flink.runtime.rest.handler.job.JobSubmitHandler.lambda$uploadJobGraphFiles$4(JobSubmitHandler.java:199)
Caused by: java.net.UnknownHostException: flink-74b96b4f8-xc49r{code}
The address flink-74b96b4f8-xc49r is a short pod hostname, not resolvable
without FQDN suffix or additional Kubernetes configuration.
h3. Tested Workaround
Add a headless service. Example:
{code:java}
apiVersion: v1
kind: Service
metadata:
name: flink-headless
spec:
clusterIP: None
publishNotReadyAddresses: true
selector:
app: flink
component: jobmanager
ports:
- name: blob
port: 6124
- name: rpc
port: 6123 {code}
* resolves the DNS issue and allows job submission to work
* Does not negatively impact deployments without HA enabled
* Headless services are a standard Kubernetes pattern for enabling pod
hostname resolution [5][6]
h3. Follow-up / questions
# Is returning pod short hostnames an intended behavior?
# Are there side effects or edge cases with the headless service approach that
need consideration?
# Are there other solutions besides headless service?
If headless service is an acceptable solution:
* Should it be documented as a requirement for HA in Kubernetes (and where)?
* Should it be created automatically by the Flink Kubernetes Operator?
h2. References
[1] FLINK-38109 - https://issues.apache.org/jira/browse/FLINK-38109
[2] Commit introducing the change -
[https://github.com/apache/flink/commit/deee02b665c01d56d43da4df2eadfbbef333ec3c]
[3] Kubernetes DNS for Services and Pods - Pod hostname DNS resolution -
[https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-hostname-and-subdomain-field]
[4] Affected code -
[https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/job/JobSubmitHandler.java#L192-L200]
[5] Kubernetes Documentation - Headless Services -
[https://kubernetes.io/docs/concepts/services-networking/service/#headless-services]
[6] StatefulSets and Headless Services -
[https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#stable-network-id]
was:
Job submission _via Flink CLI within the cluster_ fails in Kubernetes HA mode
because BlobServer returns short pod hostnames (e.g.,
{{{}flink-74b96b4f8-xc49r{}}}) instead of resolvable addresses.
* Job submission from inside the cluster fails with {{UnknownHostException}}
* Affects FlinkDeployment CRs with HA enabled
* Breaks CI/CD pipelines running in-cluster
h3. Suspected cause
FLINK-38109 [1] changed how BlobServer address resolution works in Flink 2.2.
The new {{getBlobServerAddress()}} method [2] returns the pod's bind address
(hostname), but in Kubernetes, short pod hostnames aren't DNS-resolvable
without a headless service [3]. Flink 1.20 returned IP addresses which were
directly resolvable.
The change [4] in {{JobSubmitHandler.java}} (lines 192-200) now uses
{{gateway.getBlobServerAddress()}} instead of {{{}gateway.getHostname(){}}}.
h3. Reproduction
1/ Deploy Flink 2.2.0 with HA in Kubernetes:
{code:java}
apiVersion: flink.apache.org/v1beta1
kind: FlinkDeployment
metadata:
name: flink-example
spec:
image: flink:2.2.0
flinkVersion: v2_2
flinkConfiguration:
high-availability.type: kubernetes
high-availability.storageDir: file:///opt/flink/ha
jobManager:
resource:
memory: "2048m"
taskManager:
resource:
memory: "2048m"{code}
2/ Submit job from inside JobManager pod:
{code:sh}
kubectl exec -n <namespace> <jobmanager-pod> -- \
flink run /opt/flink/examples/streaming/StateMachineExample.jar{code}
{*}Result{*}: Fails with UnknownHostException: flink-74b96b4f8-xc49r
{code:java}
Caused by: java.io.IOException: Could not connect to BlobServer at address
flink-74b96b4f8-xc49r/<unresolved>:6124 at
org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:103) at
org.apache.flink.runtime.rest.handler.job.JobSubmitHandler.lambda$uploadJobGraphFiles$4(JobSubmitHandler.java:199)
Caused by: java.net.UnknownHostException: flink-74b96b4f8-xc49r {code}
The address flink-74b96b4f8-xc49r is a short pod hostname, not resolvable
without FQDN suffix or additional Kubernetes configuration.
h3. Tested Workaround
Add a headless service. Example:
{code:java}
apiVersion: v1
kind: Service
metadata:
name: flink-headless
spec:
clusterIP: None
publishNotReadyAddresses: true
selector:
app: flink
component: jobmanager
ports:
- name: blob
port: 6124
- name: rpc
port: 6123 {code}
* resolves the DNS issue and allows job submission to work
* Does not negatively impact deployments without HA enabled
* Headless services are a standard Kubernetes pattern for enabling pod
hostname resolution [5][6]
h3. Follow-up / questions
# Is returning pod short hostnames an intended behavior?
# Are there side effects or edge cases with the headless service approach that
need consideration?
# Are there other solutions besides headless service?
If headless service is an acceptable solution:
* Should it be documented as a requirement for HA in Kubernetes (and where)?
* Should it be created automatically by the Flink Kubernetes Operator?
h2. References
[1] FLINK-38109 - https://issues.apache.org/jira/browse/FLINK-38109
[2] Commit introducing the change -
[https://github.com/apache/flink/commit/deee02b665c01d56d43da4df2eadfbbef333ec3c]
[3] Kubernetes DNS for Services and Pods - Pod hostname DNS resolution -
[https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-hostname-and-subdomain-field]
[4] Affected code -
[https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/job/JobSubmitHandler.java#L192-L200]
[5] Kubernetes Documentation - Headless Services -
[https://kubernetes.io/docs/concepts/services-networking/service/#headless-services]
[6] StatefulSets and Headless Services -
[https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#stable-network-id]
> BlobServer returns unresolvable pod hostname in Kubernetes HA mode, breaking
> job submission (Regression from 1.20)
> ------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-39086
> URL: https://issues.apache.org/jira/browse/FLINK-39086
> Project: Flink
> Issue Type: Bug
> Components: Deployment / Kubernetes, Runtime / Coordination
> Affects Versions: 2.2.0
> Environment: * {*}Flink Version{*}: 2.2.0
> * {*}Flink Operator Version{*}: 1.13.0
> * {*}Deployment Mode{*}: Native Kubernetes with HA enabled
> Reporter: Sebastien Pereira
> Priority: Major
> Labels: high-availability, kubernetes, regression
>
> Job submission _via Flink CLI within the cluster_ fails in Kubernetes HA
> mode because BlobServer returns short pod hostnames (e.g.,
> {{{}flink-74b96b4f8-xc49r{}}}) instead of resolvable addresses.
> * Job submission from inside the cluster fails with {{UnknownHostException}}
> * Affects FlinkDeployment CRs with HA enabled
> * Breaks CI/CD pipelines running in-cluster
> h3. Suspected cause
> FLINK-38109 [1] changed how BlobServer address resolution works in Flink 2.2.
> The new {{getBlobServerAddress()}} method [2] returns the pod's bind address
> (hostname), but in Kubernetes, short pod hostnames aren't DNS-resolvable
> without a headless service [3]. Flink 1.20 returned IP addresses which were
> directly resolvable.
> The change [4] in {{JobSubmitHandler.java}} (lines 192-200) now uses
> {{gateway.getBlobServerAddress()}} instead of {{{}gateway.getHostname(){}}}.
> h3. Reproduction
> 1/ Deploy Flink 2.2.0 with HA in Kubernetes:
>
> {code:java}
> apiVersion: flink.apache.org/v1beta1
> kind: FlinkDeployment
> metadata:
> name: flink-example
> spec:
> image: flink:2.2.0
> flinkVersion: v2_2
> flinkConfiguration:
> high-availability.type: kubernetes
> high-availability.storageDir: file:///opt/flink/ha
> jobManager:
> resource:
> memory: "2048m"
> taskManager:
> resource:
> memory: "2048m"{code}
> 2/ Submit job from inside JobManager pod:
> {code:java}
> kubectl exec -n <namespace> <jobmanager-pod> -- \
> flink run /opt/flink/examples/streaming/StateMachineExample.jar{code}
> {*}Result{*}: Fails with UnknownHostException: flink-74b96b4f8-xc49r
> {code:java}
> Caused by: java.io.IOException: Could not connect to BlobServer at address
> flink-74b96b4f8-xc49r/<unresolved>:6124
> at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:103)
> at
> org.apache.flink.runtime.rest.handler.job.JobSubmitHandler.lambda$uploadJobGraphFiles$4(JobSubmitHandler.java:199)
> Caused by: java.net.UnknownHostException: flink-74b96b4f8-xc49r{code}
> The address flink-74b96b4f8-xc49r is a short pod hostname, not resolvable
> without FQDN suffix or additional Kubernetes configuration.
> h3. Tested Workaround
> Add a headless service. Example:
> {code:java}
> apiVersion: v1
> kind: Service
> metadata:
> name: flink-headless
> spec:
> clusterIP: None
> publishNotReadyAddresses: true
> selector:
> app: flink
> component: jobmanager
> ports:
> - name: blob
> port: 6124
> - name: rpc
> port: 6123 {code}
> * resolves the DNS issue and allows job submission to work
> * Does not negatively impact deployments without HA enabled
> * Headless services are a standard Kubernetes pattern for enabling pod
> hostname resolution [5][6]
> h3. Follow-up / questions
> # Is returning pod short hostnames an intended behavior?
> # Are there side effects or edge cases with the headless service approach
> that need consideration?
> # Are there other solutions besides headless service?
> If headless service is an acceptable solution:
> * Should it be documented as a requirement for HA in Kubernetes (and where)?
> * Should it be created automatically by the Flink Kubernetes Operator?
> h2. References
> [1] FLINK-38109 - https://issues.apache.org/jira/browse/FLINK-38109
> [2] Commit introducing the change -
> [https://github.com/apache/flink/commit/deee02b665c01d56d43da4df2eadfbbef333ec3c]
> [3] Kubernetes DNS for Services and Pods - Pod hostname DNS resolution -
> [https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-hostname-and-subdomain-field]
> [4] Affected code -
> [https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/job/JobSubmitHandler.java#L192-L200]
> [5] Kubernetes Documentation - Headless Services -
> [https://kubernetes.io/docs/concepts/services-networking/service/#headless-services]
> [6] StatefulSets and Headless Services -
> [https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#stable-network-id]
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)