Copilot commented on code in PR #13772: URL: https://github.com/apache/skywalking/pull/13772#discussion_r3014177486
########## test/e2e-v2/cases/envoy-ai-gateway/docker-compose.yml: ########## @@ -0,0 +1,85 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Envoy AI Gateway e2e — ai-gateway-cli + Ollama + SkyWalking OAP +# +# Architecture: +# trigger → ai-gateway-cli (port 1975) → ollama (port 11434) +# ↓ OTLP gRPC +# oap (port 11800) → banyandb + +services: + banyandb: + extends: + file: ../../script/docker-compose/base-compose.yml + service: banyandb + networks: + - e2e + + oap: + extends: + file: ../../script/docker-compose/base-compose.yml + service: oap + environment: + SW_STORAGE: banyandb + ports: + - 12800 + depends_on: + banyandb: + condition: service_healthy + + ollama: + image: ollama/ollama:0.6.2 + networks: + - e2e + expose: + - 11434 + healthcheck: + test: ["CMD", "ollama", "list"] + interval: 5s + timeout: 60s + retries: 120 + + aigw: + # TODO: pin to a release version once ai-gateway-cli HTTP listener is available in a release + image: envoyproxy/ai-gateway-cli:latest Review Comment: The E2E test uses `envoyproxy/ai-gateway-cli:latest`, which makes CI non-deterministic and can break when upstream publishes a new image. Please pin to a released version or at least a specific image digest, and update the TODO accordingly. ```suggestion # TODO: bump ai-gateway-cli version when updating Envoy AI Gateway in tests image: envoyproxy/ai-gateway-cli:v0.1.0 ``` ########## oap-server/server-starter/src/main/resources/otel-rules/envoy-ai-gateway/gateway-service.yaml: ########## @@ -0,0 +1,103 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Envoy AI Gateway — Service-level metrics +# +# Source OTLP metrics (dots → underscores by OTel receiver): +# gen_ai_client_token_usage — Histogram (Delta), labels: gen_ai_token_type, gen_ai_provider_name, gen_ai_response_model +# gen_ai_server_request_duration — Histogram (Delta), unit: seconds +# gen_ai_server_time_to_first_token — Histogram (Delta), unit: seconds, streaming only +# gen_ai_server_time_per_output_token — Histogram (Delta), unit: seconds, streaming only +# +# All durations are in seconds from the AI Gateway; multiply by 1000 for ms display. + +filter: "{ tags -> tags.job_name == 'envoy-ai-gateway' }" +expSuffix: service(['service_name'], Layer.ENVOY_AI_GATEWAY) +metricPrefix: meter_envoy_ai_gw + +metricsRules: + # ===================== Aggregate metrics ===================== + + # Request CPM — count of requests per minute + - name: request_cpm + exp: gen_ai_server_request_duration_count.sum(['service_name', 'service_instance_id']).increase('PT1M') + + # Request latency average (ms) + - name: request_latency_avg + exp: gen_ai_server_request_duration_sum.sum(['service_name', 'service_instance_id']).increase('PT1M') / gen_ai_server_request_duration_count.sum(['service_name', 'service_instance_id']).increase('PT1M') * 1000 + + # Request latency percentile (ms) + - name: request_latency_percentile + exp: gen_ai_server_request_duration.sum(['le', 'service_name', 'service_instance_id']).increase('PT1M').histogram().histogram_percentile([50,75,90,95,99]) * 1000 + + # Input token rate (tokens/min) + - name: input_token_rate + exp: gen_ai_client_token_usage_sum.tagEqual('gen_ai_token_type', 'input').sum(['service_name', 'service_instance_id']).increase('PT1M') + + # Output token rate (tokens/min) + - name: output_token_rate + exp: gen_ai_client_token_usage_sum.tagEqual('gen_ai_token_type', 'output').sum(['service_name', 'service_instance_id']).increase('PT1M') + + # TTFT average (ms) — streaming requests only + - name: ttft_avg + exp: gen_ai_server_time_to_first_token_sum.sum(['service_name', 'service_instance_id']).increase('PT1M') / gen_ai_server_time_to_first_token_count.sum(['service_name', 'service_instance_id']).increase('PT1M') * 1000 + + # TTFT percentile (ms) + - name: ttft_percentile + exp: gen_ai_server_time_to_first_token.sum(['le', 'service_name', 'service_instance_id']).increase('PT1M').histogram().histogram_percentile([50,75,90,95,99]) * 1000 + + # TPOT average (ms) — time per output token, streaming only + - name: tpot_avg + exp: gen_ai_server_time_per_output_token_sum.sum(['service_name', 'service_instance_id']).increase('PT1M') / gen_ai_server_time_per_output_token_count.sum(['service_name', 'service_instance_id']).increase('PT1M') * 1000 + + # TPOT percentile (ms) + - name: tpot_percentile + exp: gen_ai_server_time_per_output_token.sum(['le', 'service_name', 'service_instance_id']).increase('PT1M').histogram().histogram_percentile([50,75,90,95,99]) * 1000 Review Comment: Service-scope rules are still grouping by `service_instance_id` (e.g., `sum(['service_name','service_instance_id'])`). In MAL, labels not used as entity keys become metric labels, so this produces per-instance time series under the *service* entity (high cardinality and incorrect “service aggregate” charts). For service-level aggregate/provider/model metrics, drop `service_instance_id` from the group-by; keep it only in the instance-scope rule file. ```suggestion exp: gen_ai_server_request_duration_count.sum(['service_name']).increase('PT1M') # Request latency average (ms) - name: request_latency_avg exp: gen_ai_server_request_duration_sum.sum(['service_name']).increase('PT1M') / gen_ai_server_request_duration_count.sum(['service_name']).increase('PT1M') * 1000 # Request latency percentile (ms) - name: request_latency_percentile exp: gen_ai_server_request_duration.sum(['le', 'service_name']).increase('PT1M').histogram().histogram_percentile([50,75,90,95,99]) * 1000 # Input token rate (tokens/min) - name: input_token_rate exp: gen_ai_client_token_usage_sum.tagEqual('gen_ai_token_type', 'input').sum(['service_name']).increase('PT1M') # Output token rate (tokens/min) - name: output_token_rate exp: gen_ai_client_token_usage_sum.tagEqual('gen_ai_token_type', 'output').sum(['service_name']).increase('PT1M') # TTFT average (ms) — streaming requests only - name: ttft_avg exp: gen_ai_server_time_to_first_token_sum.sum(['service_name']).increase('PT1M') / gen_ai_server_time_to_first_token_count.sum(['service_name']).increase('PT1M') * 1000 # TTFT percentile (ms) - name: ttft_percentile exp: gen_ai_server_time_to_first_token.sum(['le', 'service_name']).increase('PT1M').histogram().histogram_percentile([50,75,90,95,99]) * 1000 # TPOT average (ms) — time per output token, streaming only - name: tpot_avg exp: gen_ai_server_time_per_output_token_sum.sum(['service_name']).increase('PT1M') / gen_ai_server_time_per_output_token_count.sum(['service_name']).increase('PT1M') * 1000 # TPOT percentile (ms) - name: tpot_percentile exp: gen_ai_server_time_per_output_token.sum(['le', 'service_name']).increase('PT1M').histogram().histogram_percentile([50,75,90,95,99]) * 1000 ``` ########## oap-server/server-receiver-plugin/otel-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/otel/otlp/OpenTelemetryMetricRequestProcessor.java: ########## @@ -99,18 +103,20 @@ public void processMetricsRequest(final ExportMetricsServiceRequest requests) { log.debug("Resource attributes: {}", request.getResource().getAttributesList()); } - final Map<String, String> nodeLabels = - request - .getResource() - .getAttributesList() - .stream() - .collect(toMap( - it -> LABEL_MAPPINGS - .getOrDefault(it.getKey(), it.getKey()) - .replaceAll("\\.", "_"), - it -> anyValueToString(it.getValue()), - (v1, v2) -> v1 - )); + // First pass: collect all resource attributes with dots replaced by underscores + final Map<String, String> nodeLabels = new HashMap<>(); + for (final var it : request.getResource().getAttributesList()) { + final String key = it.getKey().replaceAll("\\.", "_"); Review Comment: In this hot path, `String#replaceAll("\\.", "_")` incurs regex overhead. `String#replace('.', '_')` (or a precompiled `Pattern`) is cheaper and does the same full replacement here. ```suggestion final String key = it.getKey().replace('.', '_'); ``` ########## oap-server/server-starter/src/main/resources/lal/envoy-ai-gateway.yaml: ########## @@ -0,0 +1,47 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Envoy AI Gateway access log processing via OTLP. +# +# Sampling policy: only persist abnormal or expensive requests. +# Normal 200 responses with low token count and no upstream failure are dropped. + +rules: + - name: envoy-ai-gateway-access-log + layer: ENVOY_AI_GATEWAY + dsl: | + filter { + // Drop normal logs: response < 400, no upstream failure, low token count + if (tag("response_code") as Integer < 400) { + if (tag("upstream_transport_failure_reason") == "" || tag("upstream_transport_failure_reason") == "-") { + if ((tag("gen_ai.usage.input_tokens") as Integer) + (tag("gen_ai.usage.output_tokens") as Integer) < 10000) { Review Comment: This rule casts `tag("response_code")` / token counts to `Integer` without guarding for missing/empty/non-numeric values. In LAL, `tag("...")` returns an empty string when absent, and `as Integer` ends up calling `Integer.parseInt("")` which throws and can break log analysis for the whole record. Add presence checks (e.g., ensure the tag is non-empty before casting) and decide on a safe default behavior when attributes are missing. ```suggestion // Guard casts: only cast response_code to Integer when present/non-empty. if (tag("response_code") != "" && (tag("response_code") as Integer) < 400) { if (tag("upstream_transport_failure_reason") == "" || tag("upstream_transport_failure_reason") == "-") { // Guard casts for token counts: only cast when both tags are present/non-empty. if (tag("gen_ai.usage.input_tokens") != "" && tag("gen_ai.usage.output_tokens") != "" && (tag("gen_ai.usage.input_tokens") as Integer) + (tag("gen_ai.usage.output_tokens") as Integer) < 10000) { ``` ########## docs/en/swip/SWIP-10/SWIP.md: ########## @@ -360,7 +404,7 @@ Each access log record is pushed as an OTLP LogRecord with the following structu | Attribute | Example | Notes | |---|---|---| -| `aigw.service` | `envoy-ai-gateway-basic` | From `OTEL_RESOURCE_ATTRIBUTES` — SkyWalking service name | +| `job_name` | `envoy-ai-gateway` | From `OTEL_RESOURCE_ATTRIBUTES` — MAL/LAL routing tag | | `service.instance.id` | `aigw-pod-7b9f4d8c5` | From `OTEL_RESOURCE_ATTRIBUTES` — SkyWalking instance name | | `service.name` | `envoy-ai-gateway` | From `OTEL_SERVICE_NAME` — mapped to `job_name` for rule routing | Review Comment: This note says `service.name` is “mapped to `job_name` for rule routing”, but the receiver change in this PR explicitly removes the `service.name → job_name` mapping and routing is now based on the explicit `job_name` resource attribute. Please update this row to avoid confusing users about which attribute drives MAL/LAL routing. ```suggestion | `job_name` | `envoy-ai-gateway` | From `OTEL_RESOURCE_ATTRIBUTES` — primary MAL/LAL routing tag | | `service.instance.id` | `aigw-pod-7b9f4d8c5` | From `OTEL_RESOURCE_ATTRIBUTES` — SkyWalking instance name | | `service.name` | `envoy-ai-gateway` | From `OTEL_SERVICE_NAME` — not used for MAL/LAL routing (routing uses `job_name`) | ``` ########## oap-server/server-receiver-plugin/otel-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/otel/otlp/OpenTelemetryMetricRequestProcessor.java: ########## @@ -154,7 +160,7 @@ private static Map<String, String> buildLabels(List<KeyValue> kvs) { return kvs .stream() .collect(toMap( - KeyValue::getKey, + it -> it.getKey().replaceAll("\\.", "_"), it -> anyValueToString(it.getValue()) )); Review Comment: `buildLabels()` now normalizes keys by replacing `.` with `_`, which can cause key collisions (e.g., both `a.b` and `a_b` → `a_b`). With `Collectors.toMap(...)` and no merge function, this throws `IllegalStateException` and can fail the whole metrics request. Consider building the map imperatively with `putIfAbsent`, or provide a merge function to deterministically resolve collisions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
