lhotari commented on code in PR #25513:
URL: https://github.com/apache/pulsar/pull/25513#discussion_r3184369154
##########
pip/pip-471.md:
##########
@@ -0,0 +1,302 @@
+# PIP-471: Authorization Operation Metrics
+
+# Background knowledge
+
+Pulsar brokers perform authorization checks before allowing clients, proxies,
and administrative callers to access
+topics, namespaces, tenants, brokers, clusters, and policy operations. These
checks are handled through the broker-side
+`AuthorizationService`, which delegates decisions to the configured
`AuthorizationProvider`.
+
+Pulsar already exposes security-related metrics, especially around
authentication. These metrics help operators detect
+login failures, unhealthy clients, and changes in access patterns. However,
Pulsar does not expose a generic broker-level
+metric stream for authorization outcomes. Authorization denials are mostly
visible through request failures and logs,
+which makes them harder to alert on and harder to compare with successful
authorization traffic.
+
+Pulsar also supports both Prometheus-compatible metrics and OpenTelemetry
metrics. New broker observability features
+should keep those pipelines aligned when possible, so operators can consume
equivalent signals regardless of their
+metrics backend.
+
+# Motivation
+
+Operators need a low-cardinality, broker-native signal that shows whether
authorization checks are succeeding or failing.
+This is useful for security alerting, baseline monitoring, and
compliance-oriented reporting.
+
+Without a dedicated authorization metric, operators have to infer
authorization denials from logs, HTTP status codes, or
+client-side errors. That is brittle and does not support standard monitoring
patterns such as:
+
+- Alerting on spikes in authorization failures.
+- Comparing authorization failures against successful authorizations.
+- Distinguishing authentication failures from authorization failures.
+- Building dashboards by authorization resource category.
+- Exporting equivalent authorization signals through both Prometheus and
OpenTelemetry.
+
+A failure-only metric is also not sufficient. Operators often need success,
failure, and error counts together to
+understand whether a denial spike reflects an attack, a rollout issue, a
policy mistake, an authorization provider
+problem, or a normal traffic shift.
+
+# Goals
+
+## In Scope
+
+- Add a low-cardinality broker authorization metric for operation outcomes.
+- Record successful, failed, and errored authorization operations.
+- Expose the metric through the Prometheus-compatible broker metrics endpoint.
+- Expose the same metric through OpenTelemetry.
+- Centralize instrumentation in `AuthorizationService` so broker authorization
paths share the same metric model.
+- Avoid identity-bearing or high-cardinality metric dimensions.
+
+## Out of Scope
+
+- Per-role, per-topic, per-tenant, per-namespace, or per-principal labels.
+- Audit-log payloads or structured security event streams.
+- New authorization APIs or binary protocol changes.
+- Alert rule definitions for downstream monitoring stacks.
+- Configuration to enable or disable this specific metric.
+
+# High Level Design
+
+Introduce a generic authorization operation counter that is incremented when
the broker finishes an authorization
+decision or rejects an authorization request before invoking the configured
provider.
+
+The metric is recorded centrally in `AuthorizationService`, which is the
broker-side entry point for authorization checks
+across topic, namespace, tenant, broker, cluster, and policy operations. Each
completed provider decision or direct
+authorization rejection emits one result with a small, fixed dimension set:
+
+- the resource type that was checked
+- the operation that was requested
+- whether the result was a success, failure, or error
+
+This metric is exported in two equivalent forms by the same helper class:
+
+- a Prometheus counter for the existing broker metrics endpoint
+- an OpenTelemetry `LongCounter` for OpenTelemetry metrics export
+
+Invalid original-principal combinations in proxied authorization flows are
counted as authorization failures because the
+broker rejects the request during authorization handling. For valid proxied
authorization flows, the broker evaluates
+both the proxy role and the original principal, and each completed
authorization decision is recorded.
+
+# Detailed Design
+
+## Design & Implementation Details
+
+This proposal introduces
`org.apache.pulsar.broker.authorization.metrics.AuthorizationMetrics`, a broker
authorization
+metrics helper that owns:
+
+- a Prometheus `Counter` for broker metrics scraping
+- an OpenTelemetry `LongCounter` for OpenTelemetry metrics export
+
+The helper uses the following constants for metric names, instrumentation
scope, label values, and OpenTelemetry
+attribute keys:
+
+| Constant | Value |
+|---|---|
+| `AUTHORIZATION_OPERATIONS_METRIC_NAME` |
`pulsar_authorization_operations_total` |
+| `AUTHORIZATION_COUNTER_METRIC_NAME` | `pulsar.authorization.operation.count`
|
+| `INSTRUMENTATION_SCOPE_NAME` | `org.apache.pulsar.authorization` |
+| `RESULT_SUCCESS` | `success` |
+| `RESULT_FAILURE` | `failure` |
+| `RESULT_ERROR` | `error` |
+| `RESOURCE_TYPE_KEY` | `pulsar.authorization.resource.type` |
+| `OPERATION_KEY` | `pulsar.authorization.operation` |
+| `RESULT_KEY` | `pulsar.authorization.result` |
+
+`AuthorizationMetrics` registers a static Prometheus counter with labels
`resource_type`, `operation`, and `result`.
+It also builds an OpenTelemetry `LongCounter` from the `OpenTelemetry`
instance passed to the constructor.
+
+The helper exposes three recording methods:
+
+| Method | Behavior |
+|---|---|
+| `recordSuccess(resourceType, operation)` | Increments the Prometheus counter
with `result="success"` and adds `1` to the OpenTelemetry counter with
`pulsar.authorization.result="success"`. |
+| `recordFailure(resourceType, operation)` | Increments the Prometheus counter
with `result="failure"` and adds `1` to the OpenTelemetry counter with
`pulsar.authorization.result="failure"`. |
+| `recordError(resourceType, operation)` | Increments the Prometheus counter
with `result="error"` and adds `1` to the OpenTelemetry counter with
`pulsar.authorization.result="error"`. |
+
+`AuthorizationService` owns one `AuthorizationMetrics` instance. The existing
`AuthorizationService` constructor remains
+available and delegates to a new constructor with `OpenTelemetry.noop()`.
`BrokerService` constructs
+`AuthorizationService` with `pulsar.getOpenTelemetry().getOpenTelemetry()` so
the OpenTelemetry counter is exported by
+the broker's OpenTelemetry pipeline.
+
+`AuthorizationService` records a result after each completed authorization
operation. If the provider returns `true`, the
+helper records a success. If the provider returns `false`, the helper records
a failure. If the provider future completes
+exceptionally, the helper records an error because authorization evaluation
failed before a boolean decision was returned.
+
+If `AuthorizationService` rejects a request before provider evaluation, such
as an invalid original-principal combination
+for proxied requests, it records a failure directly and returns a completed
`false` future. Existing
+authorization-disabled short-circuit behavior is preserved; operation methods
that already return early when
+authorization is disabled do not emit this metric on that path.
+
+The instrumentation applies to the following authorization flows:
+
+- superuser checks
+- tenant-admin checks
+- tenant operations
+- broker operations
+- cluster operations
+- cluster policy operations
+- namespace operations
+- namespace policy operations
+- topic operations
+- topic policy operations
+
+The metric dimensions are intentionally bounded. The resource type is selected
from a fixed set of constants in
+`AuthorizationMetrics`. The operation is `check` for superuser and
tenant-admin checks. For enum-backed operations, the
+operation is the lower-case enum name. If an existing authorization path does
not provide an operation value, the metric
+uses a fixed `unknown` operation value rather than failing the request path or
introducing dynamic labels.
+
+The metric does not include role names, topic names, tenant names, namespace
names, client addresses, provider names,
+exception classes, or error messages.
+
+## Public-facing Changes
+
+### Public API
+
+No public client, admin, REST, or `AuthorizationProvider` API changes.
+
+### Binary protocol
+
+No binary protocol changes.
+
+### Configuration
+
+No new configuration is required.
+
+### CLI
+
+No CLI changes.
+
+### Metrics
+
+Prometheus metric:
+
+| Field | Value |
+|---|---|
+| Full name | `pulsar_authorization_operations_total` |
+| Description | Pulsar authorization operations |
+| Type | Counter |
+| Labels | `resource_type`, `operation`, `result` |
+| Unit | operations |
+
+OpenTelemetry metric:
+
+| Field | Value |
+|---|---|
+| Full name | `pulsar.authorization.operation.count` |
+| Description | The number of authorization operations |
+| Type | `LongCounter` |
+| Attributes | `pulsar.authorization.resource.type`,
`pulsar.authorization.operation`, `pulsar.authorization.result` |
+| Unit | `{operation}` |
+
+Result values:
+
+| Value | Meaning |
+|---|---|
+| `success` | The authorization request was allowed. |
+| `failure` | The authorization request was denied or rejected by
authorization handling. |
Review Comment:
Since this metric label represents an authorization outcome, could we use
allowed/denied instead of success/failure? The explicit values will read more
clearly in dashboards and PromQL queries, and avoid ambiguity about what
success means here (policy evaluated successfully vs. access granted).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]