Re: [PR] Support MCP observability for Envoy AI Gateway [skywalking]

via GitHub Sat, 04 Apr 2026 16:55:16 -0700


Copilot commented on code in PR #13791:
URL: https://github.com/apache/skywalking/pull/13791#discussion_r3036195448



##########
oap-server/server-starter/src/main/resources/otel-rules/envoy-ai-gateway/gateway-mcp-instance.yaml:
##########
@@ -0,0 +1,80 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Envoy AI Gateway — MCP Instance-level (per-pod) metrics
+#
+# Same metrics as gateway-mcp-service.yaml but scoped to individual pods.
+# All durations are in seconds from the AI Gateway; multiply by 1000 for ms 
display.
+
+filter: "{ tags -> tags.job_name == 'envoy-ai-gateway' }"
+expSuffix: instance(['service_name'], ['service_instance_id'], 
Layer.ENVOY_AI_GATEWAY)
+metricPrefix: meter_envoy_ai_gw_mcp_instance
+
+metricsRules:
+  # ===================== Aggregate MCP metrics =====================
+
+  # MCP request CPM
+  - name: request_cpm
+    exp: mcp_request_duration_count.sum(['service_name', 
'service_instance_id']).increase('PT1M')
+
+  # MCP request latency average (ms)
+  - name: request_latency_avg
+    exp: mcp_request_duration_sum.sum(['service_name', 
'service_instance_id']).increase('PT1M') / 
mcp_request_duration_count.sum(['service_name', 
'service_instance_id']).increase('PT1M') * 1000
+
+  # MCP request latency percentile (ms)
+  - name: request_latency_percentile
+    exp: mcp_request_duration.sum(['le', 'service_name', 
'service_instance_id']).increase('PT1M').histogram().histogram_percentile([50,75,90,95,99])
 * 1000
+
+  # MCP method invocation CPM — labeled by mcp_method_name
+  - name: method_cpm
+    exp: mcp_method_count.sum(['mcp_method_name', 'service_name', 
'service_instance_id']).increase('PT1M')
+
+  # MCP error CPM
+  - name: error_cpm
+    exp: mcp_method_count.tagEqual('status', 'error').sum(['service_name', 
'service_instance_id']).increase('PT1M')
+
+  # MCP initialization latency average (ms)
+  - name: initialization_latency_avg
+    exp: mcp_initialization_duration_sum.sum(['service_name', 
'service_instance_id']).increase('PT1M') / 
mcp_initialization_duration_count.sum(['service_name', 
'service_instance_id']).increase('PT1M') * 1000
+
+  # MCP initialization latency percentile (ms)
+  - name: initialization_latency_percentile
+    exp: mcp_initialization_duration.sum(['le', 'service_name', 
'service_instance_id']).increase('PT1M').histogram().histogram_percentile([50,75,90,95,99])
 * 1000
+
+  # MCP capabilities negotiated CPM — labeled by capability_type
+  - name: capabilities_cpm
+    exp: mcp_capabilities_negotiated.sum(['capability_type', 'service_name', 
'service_instance_id']).increase('PT1M')
+
+  # ===================== Per-backend breakdown =====================
+
+  # Backend request CPM
+  - name: backend_request_cpm
+    exp: mcp_request_duration_count.sum(['mcp_backend', 'service_name', 
'service_instance_id']).increase('PT1M')
+
+  # Backend request latency average (ms)
+  - name: backend_request_latency_avg
+    exp: mcp_request_duration_sum.sum(['mcp_backend', 'service_name', 
'service_instance_id']).increase('PT1M') / 
mcp_request_duration_count.sum(['mcp_backend', 'service_name', 
'service_instance_id']).increase('PT1M') * 1000
+
+  # Backend method CPM
+  - name: backend_method_cpm
+    exp: mcp_method_count.sum(['mcp_backend', 'mcp_method_name', 
'service_name', 'service_instance_id']).increase('PT1M')
+
+  # Backend error CPM
+  - name: backend_error_cpm
+    exp: mcp_method_count.tagEqual('status', 'error').sum(['service_name', 
'service_instance_id']).increase('PT1M')

Review Comment:
   `backend_error_cpm` is intended to be a per-backend breakdown metric, but 
the expression currently sums only by `service_name` and `service_instance_id`, 
dropping the `mcp_backend` label. This makes it effectively identical to 
`error_cpm` and breaks the backend breakdown dashboard widget. Include 
`mcp_backend` in the `sum([...])` dimensions (consistent with the service-level 
rule).
   ```suggestion
       exp: mcp_method_count.tagEqual('status', 'error').sum(['mcp_backend', 
'service_name', 'service_instance_id']).increase('PT1M')
   ```



##########
oap-server/server-starter/src/main/resources/lal/envoy-ai-gateway.yaml:
##########
@@ -15,29 +15,33 @@
 
 # Envoy AI Gateway access log processing via OTLP.
 #
-# Sampling policy: only persist abnormal or expensive requests.
-# Normal 200 responses with low token count and no upstream failure are 
dropped.
+# Two rules: one for LLM route logs, one for MCP route logs.
+# LLM sampling: only persist abnormal or expensive requests.
+# MCP sampling: only persist error responses.
+# Both tag ai_route_type for searchable filtering in the UI.
 
 rules:
-  - name: envoy-ai-gateway-access-log
+  - name: envoy-ai-gateway-llm-access-log
     layer: ENVOY_AI_GATEWAY
     dsl: |
       filter {
+        // Only process LLM route logs (gen_ai.request.model is always set for 
LLM routes, even on errors)
+        if (tag("gen_ai.request.model") == "" || tag("gen_ai.request.model") 
== "-") {
+          abort {}
+        }
+
         // Drop normal logs: response < 400, no upstream failure, low token 
count
+        // Keep only: error responses (>= 400), upstream failures, or 
high-token requests (>= 10000)
         if (tag("response_code") != "" && tag("response_code") != "-") {
           if (tag("response_code") as Integer < 400) {
             if (tag("upstream_transport_failure_reason") == "" || 
tag("upstream_transport_failure_reason") == "-") {
-              if (tag("gen_ai.usage.input_tokens") != "" && 
tag("gen_ai.usage.input_tokens") != "-"
-                  && tag("gen_ai.usage.output_tokens") != "" && 
tag("gen_ai.usage.output_tokens") != "-") {
-                if ((tag("gen_ai.usage.input_tokens") as Integer) + 
(tag("gen_ai.usage.output_tokens") as Integer) < 10000) {
-                  abort {}
-                }
-              }
+              abort {}
             }

Review Comment:
   The log sampling comments still mention token-based sampling ("low token 
count" / "high-token requests (>= 10000)") but the token threshold logic has 
been removed and the filter now only keeps (a) HTTP >= 400 or (b) upstream 
failures. Please update these comments to reflect the actual behavior to avoid 
misleading future changes.



##########
oap-server/server-starter/src/main/resources/lal/envoy-ai-gateway.yaml:
##########
@@ -50,3 +54,32 @@ rules:
         sink {
         }
       }
+
+  - name: envoy-ai-gateway-mcp-access-log
+    layer: ENVOY_AI_GATEWAY
+    dsl: |
+      filter {
+        // Only process MCP route logs
+        if (tag("mcp.method.name") == "" || tag("mcp.method.name") == "-") {
+          abort {}
+        }
+
+        // Only persist error responses
+        if (tag("response_code") != "" && tag("response_code") != "-") {
+          if (tag("response_code") as Integer < 400) {
+            abort {}
+          }
+        }

Review Comment:
   The MCP rule is documented as "Only persist error responses", but the 
current logic only aborts when `response_code` is present and < 400. If 
`response_code` is missing/"-", the log will be persisted, which contradicts 
the stated policy. Consider explicitly aborting when `response_code` is 
missing, or handling upstream-failure signals separately if those should be 
retained.
   ```suggestion
           if (tag("response_code") == "" || tag("response_code") == "-") {
             abort {}
           }
           if (tag("response_code") as Integer < 400) {
             abort {}
           }
   ```



##########
test/e2e-v2/cases/envoy-ai-gateway/e2e.yaml:
##########
@@ -47,6 +48,28 @@ setup:
             -d 
'{"model":"nonexistent-model","messages":[{"role":"user","content":"Hi"}]}'
           sleep 1
         done
+    - name: Send MCP requests for metric verification
+      command: |
+        # Initialize MCP session and capture session ID
+        SESSION_ID=$(curl -s -D- --max-time 15 \
+          http://${aigw_host}:${aigw_1975}/mcp \
+          -H 'Content-Type: application/json' \
+          -d 
'{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2025-03-26","capabilities":{},"clientInfo":{"name":"e2e-test","version":"1.0"}}}'
 \
+          2>&1 | grep -i 'Mcp-Session-Id' | tr -d '\r' | sed 
's/Mcp-Session-Id: //')
+        sleep 2
+        # tools/list
+        curl -s --max-time 15 -o /dev/null \
+          http://${aigw_host}:${aigw_1975}/mcp \
+          -H 'Content-Type: application/json' \
+          -H "Mcp-Session-Id: $SESSION_ID" \
+          -d '{"jsonrpc":"2.0","id":2,"method":"tools/list","params":{}}'
+        sleep 1
+        # tools/call — call a nonexistent tool to generate an error
+        curl -s --max-time 15 -o /dev/null \

Review Comment:
   The MCP session ID extraction does not validate that `SESSION_ID` was 
actually captured. If the header is not present (or the grep/sed pipeline 
changes), subsequent requests will still run with an empty `Mcp-Session-Id`, 
potentially making the test silently flaky. Add a check that fails fast when 
`SESSION_ID` is empty (and/or have curl fail on non-2xx).
   ```suggestion
           SESSION_ID=$(curl -sS -f -D- --max-time 15 \
             http://${aigw_host}:${aigw_1975}/mcp \
             -H 'Content-Type: application/json' \
             -d 
'{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2025-03-26","capabilities":{},"clientInfo":{"name":"e2e-test","version":"1.0"}}}'
 \
             2>&1 | grep -i '^Mcp-Session-Id:' | tr -d '\r' | sed 
's/^[Mm]cp-[Ss]ession-[Ii]d: *//')
           if [ -z "$SESSION_ID" ]; then
             echo "Failed to capture MCP session ID from initialize response" 
>&2
             exit 1
           fi
           sleep 2
           # tools/list
           curl -sS -f --max-time 15 -o /dev/null \
             http://${aigw_host}:${aigw_1975}/mcp \
             -H 'Content-Type: application/json' \
             -H "Mcp-Session-Id: $SESSION_ID" \
             -d '{"jsonrpc":"2.0","id":2,"method":"tools/list","params":{}}'
           sleep 1
           # tools/call — call a nonexistent tool to generate an error
           curl -sS -f --max-time 15 -o /dev/null \
   ```



##########
docs/en/setup/backend/backend-envoy-ai-gateway-monitoring.md:
##########
@@ -4,8 +4,8 @@
 
 [Envoy AI Gateway](https://aigateway.envoyproxy.io/) is a gateway/proxy for 
AI/LLM API traffic
 (OpenAI, Anthropic, AWS Bedrock, Azure OpenAI, Google Gemini, etc.) built on 
top of Envoy Proxy.
-It natively emits GenAI metrics and access logs via OTLP, following
-[OpenTelemetry GenAI Semantic 
Conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/).
+It natively emits GenAI metrics, MCP (Model Context Protocol) metrics, and 
access logs via OTLP,
+following [OpenTelemetry GenAI Semantic 
Conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/).

Review Comment:
   This sentence now implies MCP metrics follow the OpenTelemetry GenAI 
Semantic Conventions. GenAI semconv covers GenAI signals, but MCP metrics are a 
separate set (as defined by Envoy AI Gateway). Consider rephrasing to avoid 
suggesting MCP is part of the GenAI semantic conventions (e.g., say GenAI 
follows the semconv, and MCP metrics are emitted via OTLP as well).
   ```suggestion
   It natively emits GenAI metrics following
   [OpenTelemetry GenAI Semantic 
Conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/),
   and also emits MCP (Model Context Protocol) metrics and access logs via OTLP.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Support MCP observability for Envoy AI Gateway [skywalking]

Reply via email to