Re: [PR] Support Envoy AI Gateway observability (SWIP-10) [skywalking]

via GitHub Tue, 31 Mar 2026 00:58:33 -0700


Copilot commented on code in PR #13772:
URL: https://github.com/apache/skywalking/pull/13772#discussion_r3014177486



##########
test/e2e-v2/cases/envoy-ai-gateway/docker-compose.yml:
##########
@@ -0,0 +1,85 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Envoy AI Gateway e2e — ai-gateway-cli + Ollama + SkyWalking OAP
+#
+# Architecture:
+#   trigger → ai-gateway-cli (port 1975) → ollama (port 11434)
+#                  ↓ OTLP gRPC
+#             oap (port 11800) → banyandb
+
+services:
+  banyandb:
+    extends:
+      file: ../../script/docker-compose/base-compose.yml
+      service: banyandb
+    networks:
+      - e2e
+
+  oap:
+    extends:
+      file: ../../script/docker-compose/base-compose.yml
+      service: oap
+    environment:
+      SW_STORAGE: banyandb
+    ports:
+      - 12800
+    depends_on:
+      banyandb:
+        condition: service_healthy
+
+  ollama:
+    image: ollama/ollama:0.6.2
+    networks:
+      - e2e
+    expose:
+      - 11434
+    healthcheck:
+      test: ["CMD", "ollama", "list"]
+      interval: 5s
+      timeout: 60s
+      retries: 120
+
+  aigw:
+    # TODO: pin to a release version once ai-gateway-cli HTTP listener is 
available in a release
+    image: envoyproxy/ai-gateway-cli:latest

Review Comment:
   The E2E test uses `envoyproxy/ai-gateway-cli:latest`, which makes CI 
non-deterministic and can break when upstream publishes a new image. Please pin 
to a released version or at least a specific image digest, and update the TODO 
accordingly.
   ```suggestion
       # TODO: bump ai-gateway-cli version when updating Envoy AI Gateway in 
tests
       image: envoyproxy/ai-gateway-cli:v0.1.0
   ```



##########
oap-server/server-starter/src/main/resources/otel-rules/envoy-ai-gateway/gateway-service.yaml:
##########
@@ -0,0 +1,103 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Envoy AI Gateway — Service-level metrics
+#
+# Source OTLP metrics (dots → underscores by OTel receiver):
+#   gen_ai_client_token_usage          — Histogram (Delta), labels: 
gen_ai_token_type, gen_ai_provider_name, gen_ai_response_model
+#   gen_ai_server_request_duration     — Histogram (Delta), unit: seconds
+#   gen_ai_server_time_to_first_token  — Histogram (Delta), unit: seconds, 
streaming only
+#   gen_ai_server_time_per_output_token — Histogram (Delta), unit: seconds, 
streaming only
+#
+# All durations are in seconds from the AI Gateway; multiply by 1000 for ms 
display.
+
+filter: "{ tags -> tags.job_name == 'envoy-ai-gateway' }"
+expSuffix: service(['service_name'], Layer.ENVOY_AI_GATEWAY)
+metricPrefix: meter_envoy_ai_gw
+
+metricsRules:
+  # ===================== Aggregate metrics =====================
+
+  # Request CPM — count of requests per minute
+  - name: request_cpm
+    exp: gen_ai_server_request_duration_count.sum(['service_name', 
'service_instance_id']).increase('PT1M')
+
+  # Request latency average (ms)
+  - name: request_latency_avg
+    exp: gen_ai_server_request_duration_sum.sum(['service_name', 
'service_instance_id']).increase('PT1M') / 
gen_ai_server_request_duration_count.sum(['service_name', 
'service_instance_id']).increase('PT1M') * 1000
+
+  # Request latency percentile (ms)
+  - name: request_latency_percentile
+    exp: gen_ai_server_request_duration.sum(['le', 'service_name', 
'service_instance_id']).increase('PT1M').histogram().histogram_percentile([50,75,90,95,99])
 * 1000
+
+  # Input token rate (tokens/min)
+  - name: input_token_rate
+    exp: gen_ai_client_token_usage_sum.tagEqual('gen_ai_token_type', 
'input').sum(['service_name', 'service_instance_id']).increase('PT1M')
+
+  # Output token rate (tokens/min)
+  - name: output_token_rate
+    exp: gen_ai_client_token_usage_sum.tagEqual('gen_ai_token_type', 
'output').sum(['service_name', 'service_instance_id']).increase('PT1M')
+
+  # TTFT average (ms) — streaming requests only
+  - name: ttft_avg
+    exp: gen_ai_server_time_to_first_token_sum.sum(['service_name', 
'service_instance_id']).increase('PT1M') / 
gen_ai_server_time_to_first_token_count.sum(['service_name', 
'service_instance_id']).increase('PT1M') * 1000
+
+  # TTFT percentile (ms)
+  - name: ttft_percentile
+    exp: gen_ai_server_time_to_first_token.sum(['le', 'service_name', 
'service_instance_id']).increase('PT1M').histogram().histogram_percentile([50,75,90,95,99])
 * 1000
+
+  # TPOT average (ms) — time per output token, streaming only
+  - name: tpot_avg
+    exp: gen_ai_server_time_per_output_token_sum.sum(['service_name', 
'service_instance_id']).increase('PT1M') / 
gen_ai_server_time_per_output_token_count.sum(['service_name', 
'service_instance_id']).increase('PT1M') * 1000
+
+  # TPOT percentile (ms)
+  - name: tpot_percentile
+    exp: gen_ai_server_time_per_output_token.sum(['le', 'service_name', 
'service_instance_id']).increase('PT1M').histogram().histogram_percentile([50,75,90,95,99])
 * 1000

Review Comment:
   Service-scope rules are still grouping by `service_instance_id` (e.g., 
`sum(['service_name','service_instance_id'])`). In MAL, labels not used as 
entity keys become metric labels, so this produces per-instance time series 
under the *service* entity (high cardinality and incorrect “service aggregate” 
charts). For service-level aggregate/provider/model metrics, drop 
`service_instance_id` from the group-by; keep it only in the instance-scope 
rule file.
   ```suggestion
       exp: 
gen_ai_server_request_duration_count.sum(['service_name']).increase('PT1M')
   
     # Request latency average (ms)
     - name: request_latency_avg
       exp: 
gen_ai_server_request_duration_sum.sum(['service_name']).increase('PT1M') / 
gen_ai_server_request_duration_count.sum(['service_name']).increase('PT1M') * 
1000
   
     # Request latency percentile (ms)
     - name: request_latency_percentile
       exp: gen_ai_server_request_duration.sum(['le', 
'service_name']).increase('PT1M').histogram().histogram_percentile([50,75,90,95,99])
 * 1000
   
     # Input token rate (tokens/min)
     - name: input_token_rate
       exp: gen_ai_client_token_usage_sum.tagEqual('gen_ai_token_type', 
'input').sum(['service_name']).increase('PT1M')
   
     # Output token rate (tokens/min)
     - name: output_token_rate
       exp: gen_ai_client_token_usage_sum.tagEqual('gen_ai_token_type', 
'output').sum(['service_name']).increase('PT1M')
   
     # TTFT average (ms) — streaming requests only
     - name: ttft_avg
       exp: 
gen_ai_server_time_to_first_token_sum.sum(['service_name']).increase('PT1M') / 
gen_ai_server_time_to_first_token_count.sum(['service_name']).increase('PT1M') 
* 1000
   
     # TTFT percentile (ms)
     - name: ttft_percentile
       exp: gen_ai_server_time_to_first_token.sum(['le', 
'service_name']).increase('PT1M').histogram().histogram_percentile([50,75,90,95,99])
 * 1000
   
     # TPOT average (ms) — time per output token, streaming only
     - name: tpot_avg
       exp: 
gen_ai_server_time_per_output_token_sum.sum(['service_name']).increase('PT1M') 
/ 
gen_ai_server_time_per_output_token_count.sum(['service_name']).increase('PT1M')
 * 1000
   
     # TPOT percentile (ms)
     - name: tpot_percentile
       exp: gen_ai_server_time_per_output_token.sum(['le', 
'service_name']).increase('PT1M').histogram().histogram_percentile([50,75,90,95,99])
 * 1000
   ```



##########
oap-server/server-receiver-plugin/otel-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/otel/otlp/OpenTelemetryMetricRequestProcessor.java:
##########
@@ -99,18 +103,20 @@ public void processMetricsRequest(final 
ExportMetricsServiceRequest requests) {
                     log.debug("Resource attributes: {}", 
request.getResource().getAttributesList());
                 }
 
-                final Map<String, String> nodeLabels =
-                    request
-                        .getResource()
-                        .getAttributesList()
-                        .stream()
-                        .collect(toMap(
-                            it -> LABEL_MAPPINGS
-                                .getOrDefault(it.getKey(), it.getKey())
-                                .replaceAll("\\.", "_"),
-                                it -> anyValueToString(it.getValue()),
-                        (v1, v2) -> v1
-                        ));
+                // First pass: collect all resource attributes with dots 
replaced by underscores
+                final Map<String, String> nodeLabels = new HashMap<>();
+                for (final var it : request.getResource().getAttributesList()) 
{
+                    final String key = it.getKey().replaceAll("\\.", "_");

Review Comment:
   In this hot path, `String#replaceAll("\\.", "_")` incurs regex overhead. 
`String#replace('.', '_')` (or a precompiled `Pattern`) is cheaper and does the 
same full replacement here.
   ```suggestion
                       final String key = it.getKey().replace('.', '_');
   ```



##########
oap-server/server-starter/src/main/resources/lal/envoy-ai-gateway.yaml:
##########
@@ -0,0 +1,47 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Envoy AI Gateway access log processing via OTLP.
+#
+# Sampling policy: only persist abnormal or expensive requests.
+# Normal 200 responses with low token count and no upstream failure are 
dropped.
+
+rules:
+  - name: envoy-ai-gateway-access-log
+    layer: ENVOY_AI_GATEWAY
+    dsl: |
+      filter {
+        // Drop normal logs: response < 400, no upstream failure, low token 
count
+        if (tag("response_code") as Integer < 400) {
+          if (tag("upstream_transport_failure_reason") == "" || 
tag("upstream_transport_failure_reason") == "-") {
+            if ((tag("gen_ai.usage.input_tokens") as Integer) + 
(tag("gen_ai.usage.output_tokens") as Integer) < 10000) {

Review Comment:
   This rule casts `tag("response_code")` / token counts to `Integer` without 
guarding for missing/empty/non-numeric values. In LAL, `tag("...")` returns an 
empty string when absent, and `as Integer` ends up calling 
`Integer.parseInt("")` which throws and can break log analysis for the whole 
record. Add presence checks (e.g., ensure the tag is non-empty before casting) 
and decide on a safe default behavior when attributes are missing.
   ```suggestion
           // Guard casts: only cast response_code to Integer when 
present/non-empty.
           if (tag("response_code") != "" && (tag("response_code") as Integer) 
< 400) {
             if (tag("upstream_transport_failure_reason") == "" || 
tag("upstream_transport_failure_reason") == "-") {
               // Guard casts for token counts: only cast when both tags are 
present/non-empty.
               if (tag("gen_ai.usage.input_tokens") != "" && 
tag("gen_ai.usage.output_tokens") != "" &&
                 (tag("gen_ai.usage.input_tokens") as Integer) + 
(tag("gen_ai.usage.output_tokens") as Integer) < 10000) {
   ```



##########
docs/en/swip/SWIP-10/SWIP.md:
##########
@@ -360,7 +404,7 @@ Each access log record is pushed as an OTLP LogRecord with 
the following structu
 
 | Attribute | Example | Notes |
 |---|---|---|
-| `aigw.service` | `envoy-ai-gateway-basic` | From `OTEL_RESOURCE_ATTRIBUTES` 
— SkyWalking service name |
+| `job_name` | `envoy-ai-gateway` | From `OTEL_RESOURCE_ATTRIBUTES` — MAL/LAL 
routing tag |
 | `service.instance.id` | `aigw-pod-7b9f4d8c5` | From 
`OTEL_RESOURCE_ATTRIBUTES` — SkyWalking instance name |
 | `service.name` | `envoy-ai-gateway` | From `OTEL_SERVICE_NAME` — mapped to 
`job_name` for rule routing |

Review Comment:
   This note says `service.name` is “mapped to `job_name` for rule routing”, 
but the receiver change in this PR explicitly removes the `service.name → 
job_name` mapping and routing is now based on the explicit `job_name` resource 
attribute. Please update this row to avoid confusing users about which 
attribute drives MAL/LAL routing.
   ```suggestion
   | `job_name` | `envoy-ai-gateway` | From `OTEL_RESOURCE_ATTRIBUTES` — 
primary MAL/LAL routing tag |
   | `service.instance.id` | `aigw-pod-7b9f4d8c5` | From 
`OTEL_RESOURCE_ATTRIBUTES` — SkyWalking instance name |
   | `service.name` | `envoy-ai-gateway` | From `OTEL_SERVICE_NAME` — not used 
for MAL/LAL routing (routing uses `job_name`) |
   ```



##########
oap-server/server-receiver-plugin/otel-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/otel/otlp/OpenTelemetryMetricRequestProcessor.java:
##########
@@ -154,7 +160,7 @@ private static Map<String, String> 
buildLabels(List<KeyValue> kvs) {
         return kvs
             .stream()
             .collect(toMap(
-                KeyValue::getKey,
+                it -> it.getKey().replaceAll("\\.", "_"),
                 it -> anyValueToString(it.getValue())
             ));

Review Comment:
   `buildLabels()` now normalizes keys by replacing `.` with `_`, which can 
cause key collisions (e.g., both `a.b` and `a_b` → `a_b`). With 
`Collectors.toMap(...)` and no merge function, this throws 
`IllegalStateException` and can fail the whole metrics request. Consider 
building the map imperatively with `putIfAbsent`, or provide a merge function 
to deterministically resolve collisions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Support Envoy AI Gateway observability (SWIP-10) [skywalking]

Reply via email to