Draft PR)

Sergio Chong Loo Mon, 30 Mar 2026 22:49:26 -0700

Hi @Ryan/@Daniel

OK it’s up! I’ve updated the Draft PR and added how to configure the Gate 
auto-injection to the description 
(https://github.com/apache/flink-kubernetes-operator/pull/1043).


TL;DR
When transitionMode: ADVANCED is used and bluegreen.gate.strategy is set in 
flinkConfiguration, the operator automatically:
    • Injects the -javaagent flag into the JobManager JVM options
    • Adds an init container to the JobManager pod that copies the agent JAR 
from the operator image into a shared volume, making it available at 
/opt/flink/lib/bluegreen-agent.jar
    • Sets bluegreen.gate.injection.enabled: "true" to activate injection at 
runtime

The only user-facing configuration needed is:
flinkConfiguration:
    bluegreen.gate.strategy: "WATERMARK"
    bluegreen.gate.watermark.extractor-class: 
"com.example.MyWatermarkExtractor” ...
Give it a go and please let me know what you think.

In the meantime I’m working on the other comments you also left

Thanks!
Sergio


> On Mar 26, 2026, at 8:11 AM, Ryan van Huuksloot via dev 
> <[email protected]> wrote:
> 
> Awesome! Thanks for the update, Sergio. I'm excited to see the plan - it is
> a cool idea so I'm glad it is working.
> 
> Ryan van Huuksloot
> Staff Engineer, Infrastructure | Streaming Platform
> [image: Shopify]
> <https://www.shopify.com/?utm_medium=salessignatures&utm_source=hs_email>
> 
> 
> On Thu, Mar 26, 2026 at 1:46 AM Sergio Chong Loo <[email protected]>
> wrote:
> 
>> @Ryan / @Daniel,
>> 
>> Good news! The “GateInjectorPipelineExecutor” idea is successful!!
>> 
>> While the original approach of simply activating it with
>> “execution.target” did not quite work, I was able to implement it via the 
>> *Instrumentation
>> API with a Java Agent* that injects it… the user doesn’t have to touch
>> their pipelines and I added 2 options, at least for now, to place/inject
>> the Gate after the source or before the sink (complex DAG cases with
>> multiple sources or sinks for now are not supported).
>> 
>> I’m documenting everything and prepping the Draft PR for your review,
>> probably a couple more days.
>> 
>> Thanks, stay tuned.
>> 
>> - Sergio
>> 
>> 
>> On Mar 16, 2026, at 3:30 PM, Sergio Chong Loo <[email protected]> wrote:
>> 
>> Thanks for the ideas and the offer to help out Ryan! It’s invaluable to
>> learn about how other users/teams scenarios.
>> 
>> Indeed I have to pursue and evaluate the GateInjectorExecutor nonetheless
>> for our internal development. Ideally it’d be great if the user can simply
>> “invoke” the functionality, even give the user an option to specify “where"
>> the gating mechanism to be placed (e.g. right after the source or before
>> sink), or for the most flexibility they can incorporate and place the Gate
>> manually just like it is now.
>> 
>> I’ll share the progress asap and we can all take it from there (this
>> should not exceed a couple weeks). I’ll definitely need more of your
>> feedback to verify this with Flink SQL.
>> 
>> Thanks again,
>> Sergio
>> 
>> 
>> On Mar 16, 2026, at 7:01 AM, Ryan van Huuksloot <
>> [email protected]> wrote:
>> 
>> Hi Sergio,
>> 
>> re: 1.1
>> My thought is that a BlueGreen Mixin isn't Kubernetes specific and could
>> be reused by other deployment control planes. However, I do agree that
>> attaching it to the sink has other implications so I am happy to pivot if
>> we can find an alternative solution.
>> 
>> re: 2
>> I'm happy to leave it out of the Phase 2 implementation, but I think it
>> should be possible. For example we use Phase 1 with cross cluster
>> migrations today. Phase 2 within a single cluster isn't particularly useful
>> for us.
>> 
>> re: GateInjectorExecutor
>> This sounds like a neat idea. I need to read more about how it would work
>> but from a high level, injecting an operator before your sinks sounds like
>> a good idea. Better isolation, possible with SQL, no mixins, etc.
>> 
>> I will mention that part of the reason I want it before the sinks is
>> because nine out of ten people building pipelines struggle to understand
>> where their state is and how Phase 2 would affect the correctness of their
>> state depending on where they put the gate. I understand that if you have a
>> remote lookup and want to save bandwidth, you could optimize your pipeline
>> by moving the gate before the remote call; however, that seems like an
>> optimization that can be made later.
>> 
>> Thanks for driving this! Let me know how we can help.
>> 
>> Ryan van Huuksloot
>> Staff Engineer, Infrastructure | Streaming Platform
>> [image: Shopify]
>> <https://www.shopify.com/?utm_medium=salessignatures&utm_source=hs_email>
>> 
>> 
>> On Mon, Mar 16, 2026 at 2:21 AM Sergio Chong Loo <[email protected]>
>> wrote:
>> 
>>> Hi Ryan
>>> 
>>> Thanks a lot for these details. For sure some of these observations
>>> popped up during our initial discussions, and that’s why our initial goal
>>> was to introduce this as simple as possible and gradually enhance it to
>>> cover gaps.
>>> 
>>> Allow me to address your concerns:
>>> 
>>>   1. I’m happy you stressed the point of “disruption to existing
>>>   pipelines”. However, there’s a few points about attempting to build this
>>>   functionality into the sinks (or sources) right off the bat (read further
>>>   below for my alternative):
>>>      1. Kubernetes centric: as of now the Blue/Green Deployments
>>>      support is a Kubernetes specific solution, adding a mixin directly
>>>      available to sinks would “leak” this support outside of K8s
>>>      2. A sink being aware of these deployment phases violates single
>>>      responsibility, but more importantly…
>>>      3. Flink currently has many connectors, with the majority being
>>>      maintained outside of the Flink code base, by separate teams, separate
>>>      repos, separate release cycles. This would complicate things 
>>> significantly
>>>      as to try and add support for this for every potential flink connector
>>>      project out there would be a cumbersome. Blue/Green Phase 2 then only 
>>> would
>>>      works with "gate-aware" sinks.
>>>   2. I’d leave the conversation about migrating jobs between K8s
>>>   clusters outside of this scope, even Phase 1 is meant to only work in a
>>>   single cluster…
>>>   3. Watermarking, excellent point, it’s indeed a requirement so I’ll
>>>   make sure this is validated where applicable (by the concrete
>>>   implementation)
>>> 
>>> 
>>> Having said what I said about point 1.1 above, I’m currently working on
>>> an approach which uses a “GateInjectorPipelineExecutor” so to speak; in
>>> other words a custom PipelineExecutor that would be shipped with the K8s
>>> Operator, invoked by Flink Configuration (via “execution.target:”). This
>>> custom piece would instantiate and inject the Gate at a fixed point in the
>>> StreamGraph right before job submission. I still have to validate and
>>> ensure a few things are correctly taken care of (like Type Information,
>>> etc.) but the theory looks promising.
>>> 
>>> For the most part this works well with Flink SQL (same configuration),
>>> here’s my estimation:
>>> 
>>> tEnv.executeSql("INSERT INTO my_sink ...")
>>>    └─> SQL planner → ExecNodeGraph → Transformation[]
>>>          └─> StreamGraph
>>>                └─> GateInjectorExecutor injects GateProcessFunction
>>>                      └─> StreamGraph' (mutated) → JobGraph
>>>                            └─> Submit Job
>>> 
>>> I’m aiming to share some updates along these lines in the next few weeks
>>> but hopefully this falls inline with your objectives/thoughts overall.
>>> 
>>> Sergio
>>> 
>>> 
>>> On Mar 6, 2026, at 3:36 PM, Ryan van Huuksloot via dev <
>>> [email protected]> wrote:
>>> 
>>> Hi Sergio,
>>> Thanks for starting this conversation.
>>> 
>>> A few thoughts regarding BlueGreen Phase 2:
>>> 1. The Gate Operator is interesting but I don't like that we would have to
>>> modify users' pipelines for them to use Phase 2. This gate function seems
>>> like it could be a Mixin that connectors would implement. If you want to
>>> use Phase 2, your sinks must implement this Mixin. I understand that a
>>> unique GateFunction has pros, but it works less well with FlinkSQL - and
>>> the trade-off doesn't seem worthwhile.
>>> 2. Regarding the ConfigMap. We should consider a solution that supports
>>> migrating Flink jobs between Kubernetes clusters. Otherwise Phase 2 is
>>> only
>>> useful for in cluster operations.
>>> 3. Watermarking is a requirement. Will the Flink Kubernetes Operator
>>> validate that the pipeline is using watermarks?
>>> 
>>> What happens when idleness is configured? Watermarks will get ignored from
>>> 
>>> these “slow” subtasks and advance, could records from the ignored subtasks
>>> eventually be lost?
>>> Yes they would be lost, but that would happen irrespective of Phase 2.
>>> 
>>> I'll have more thoughts after we discuss the Gate Operator, as that is
>>> crucial to the FLIP right now.
>>> 
>>> Ryan van Huuksloot
>>> Staff Engineer, Infrastructure | Streaming Platform
>>> [image: Shopify]
>>> <https://www.shopify.com/?utm_medium=salessignatures&utm_source=hs_email>
>>> 
>>> 
>>> On Mon, Mar 2, 2026 at 6:52 PM Sergio Chong Loo <[email protected]>
>>> wrote:
>>> 
>>> Bumping this (Advanced Blue/Green deployments - FLIP-504) thread after
>>> making some code adjustments.
>>> 
>>> FYI @drossos <https://github.com/drossos> @ryanvanhuuksloot <
>>> https://github.com/ryanvanhuuksloot> I’d like to get your feedback since
>>> I know you’re interested in this feature.
>>> 
>>> Thanks,
>>> - Sergio
>>> 
>>> 
>>> On Dec 5, 2025, at 2:31 PM, Sergio Chong Loo <[email protected]>
>>> 
>>> wrote:
>>> 
>>> 
>>> Hi folks,
>>> 
>>> FLIP-503 (already merged) introduced the Basic Blue/Green Deployment
>>> 
>>> functionality to the Flink K8s Operator. It was very straightforward,
>>> simply transitioning to the second deployment once it's considered stable.
>>> 
>>> 
>>> FLIP-504 is an Advanced version added on top of 503 and brings about the
>>> 
>>> notion of "record-level" coordination between the 2 deployments to have no
>>> data duplication and exactly once semantics while preserving a smooth
>>> transition.
>>> 
>>> 
>>> The main goals are:
>>>   • For the community to take a quick look at the current
>>> 
>>> functionality (previously mentioned at the Flink Forward 2025 Conference)
>>> 
>>>   • To get feedback and improvement suggestions
>>> 
>>> Flip 504 details:
>>> 
>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=337677650
>>> 
>>> 
>>> Draft PR: https://github.com/apache/flink-kubernetes-operator/pull/1043
>>> 
>>> Thank you!
>>> - Sergio

Re: FLIP-504: Blue/Green Deployments for Flink on Kubernetes - Phase 2 (WIP/Draft PR)

Reply via email to