This is an automated email from the ASF dual-hosted git repository.

mattisonchao pushed a commit to branch mattison/pip-298-authorization-metrics
in repository https://gitbox.apache.org/repos/asf/pulsar.git

commit e98dd50fbc1ead8fbc96564a6995af9863068d4f
Author: mattisonchao <[email protected]>
AuthorDate: Mon Apr 13 18:27:09 2026 +0800

    Add PIP for authorization operation metrics
---
 pip/pip-298.md | 309 ++++++++++++++++++++++++++++++---------------------------
 1 file changed, 161 insertions(+), 148 deletions(-)

diff --git a/pip/pip-298.md b/pip/pip-298.md
index ec154f228c8..7b4123f178b 100644
--- a/pip/pip-298.md
+++ b/pip/pip-298.md
@@ -1,199 +1,212 @@
-# PIP-298: Support read transaction buffer snapshot segments from earliest
+# PIP-298: Authorization Operation Metrics
 
-# Background
+# Background knowledge
 
-In the implementation of the Pulsar Transaction, each topic is configured with 
a `Transaction Buffer` to prevent
-consumers from reading uncommitted messages, which are invisible until the 
transaction is committed. Transaction Buffer
-works with Position (maxReadPosition) and `TxnID` Set (aborts). The broker 
only dispatches messages, before the
-maxReadPosition, to the consumers. When the broker dispatches the messages 
before maxReadPosition to the consumer, the
-messages sent by aborted transactions will get filtered by the Transaction 
Buffer.
+Pulsar brokers perform authorization checks before allowing clients, proxies, 
and administrative callers to access
+topics, namespaces, tenants, brokers, and policy operations. These checks are 
enforced through the broker-side
+`AuthorizationService`, which delegates the decision to the configured 
`AuthorizationProvider`.
+
+Pulsar already exposes several security-related metrics, especially around 
authentication. These metrics are useful for
+detecting login failures, unhealthy client behavior, and changes in access 
patterns. However, Pulsar does not expose an
+equivalent broker-level metric stream for authorization outcomes. In practice, 
authorization failures are primarily
+visible through logs and request failures rather than through a dedicated 
metric.
+
+Pulsar also supports OpenTelemetry metrics. For operational consistency, new 
broker observability features should align
+Prometheus-style metrics with OpenTelemetry counters rather than introducing 
instrumentation in only one pipeline.
 
 # Motivation
 
-Currently, Pulsar transactions do not have configurable isolation levels. By 
introducing isolation level configuration
-for consumers, we can enhance the flexibility of Pulsar transactions.
-
-Let's consider an example:
-
-**System**: Financial Transaction System
-
-**Operations**: Large volume of deposit and withdrawal operations, a
-small number of transfer operations.
-
-**Roles**:
-
-- **Client A1**
-- **Client A2**
-- **User Account B1**
-- **User Account B2**
-- **Request Topic C**
-- **Real-time Monitoring System D**
-- **Business Processing System E**
-
-**Client Operations**:
-
-- **Withdrawal**: Client A1 decreases the deposit amount from User
-  Account B1 or B2.
-- **Deposit**: Client A1 increases the deposit amount in User Account B1 or B2.
-- **Transfer**: Client A2 decreases the deposit amount from User
-  Account B1 and increases it in User Account B2. Or vice versa.
-
-**Real-time Monitoring System D**: Obtains the latest data from
-Request Topic C as quickly as possible to monitor transaction data and
-changes in bank reserves in real-time. This is necessary for the
-timely detection of anomalies and real-time decision-making.
-
-**Business Processing System E**: Reads data from Request Topic C,
-then actually operates User Accounts B1, B2.
-
-**User Scenario**: Client A1 sends a large number of deposit and
-withdrawal requests to Request Topic C. Client A2 writes a small
-number of transfer requests to Request Topic C.
-
-In this case, Business Processing System E needs a read-committed
-isolation level to ensure operation consistency and Exactly Once
-semantics. The real-time monitoring system does not care if a small
-number of transfer requests are incomplete (dirty data). What it
-cannot tolerate is a situation where a large number of deposit and
-withdrawal requests cannot be presented in real time due to a small
-number of transfer requests (the current situation is that uncommitted
-transaction messages can block the reading of committed transaction
-messages).
-
-In this case, it is necessary to set different isolation levels for
-different consumers/subscriptions.
-The uncommitted transactions do not impact actual users' bank accounts.
-Business Processing System E only reads committed transactional
-messages and operates users' accounts. It needs Exactly-once semantic.
-Real-time Monitoring System D reads uncommitted transactional
-messages. It does not need Exactly-once semantic.
-
-They use different subscriptions and choose different isolation
-levels. One needs transaction, one does not.
-In general, multiple subscriptions of the same topic do not all
-require transaction guarantees.
-Some want low latency without the exact-once semantic guarantee, and
-some must require the exactly-once guarantee.
-We just provide a new option for different subscriptions.
-
-# Goal
+Operators need a low-cardinality, broker-native signal that shows whether 
authorization checks are succeeding or
+failing. This is needed for security alerting, baseline monitoring, and 
compliance reporting.
+
+Without a dedicated authorization metric, operators have to infer 
authorization denials from logs, HTTP status codes,
+or client-side errors. That is brittle and does not support standard 
monitoring patterns such as:
+
+- Alerting on spikes in authorization failures.
+- Comparing authorization failures against successful authorizations.
+- Building dashboards that differentiate between authentication problems and 
authorization problems.
+- Exporting equivalent signals through both Prometheus and OpenTelemetry.
+
+The lack of a generic metric also encourages overly narrow designs such as a 
failure-only counter. That limits
+observability because operators often need both success and failure counts to 
understand whether a denial spike reflects
+an attack, a rollout problem, or a normal traffic shift.
+
+# Goals
 
 ## In Scope
 
-Implement Read Committed and Read Uncommitted isolation levels for Pulsar 
transactions. Allow consumers to configure
-isolation levels during the building process.
+- Add a low-cardinality broker authorization metric for operation outcomes.
+- Record both successful and failed authorization decisions.
+- Expose the metric through Prometheus-compatible broker metrics.
+- Expose the same metric through OpenTelemetry.
+- Centralize instrumentation in `AuthorizationService` so all broker 
authorization paths share the same metric model.
 
 ## Out of Scope
 
-None.
+- Per-role, per-topic, per-tenant, or per-principal labels.
+- Audit-log payloads or structured security event streams.
+- New authorization APIs or protocol changes.
+- Alert rule definitions for downstream monitoring stacks.
+
 
 # High Level Design
 
-Add a configuration 'subscriptionIsolationLevel' in the consumer builder to 
allow users to choose different transaction
-isolation levels.
+Introduce a generic authorization operation counter that is incremented 
whenever the broker finishes an authorization
+decision.
+
+The metric is recorded centrally in `AuthorizationService`, which already 
serves as the broker-side entry point for
+authorization checks across topic, namespace, tenant, broker, cluster, and 
policy operations. Each authorization check
+will emit one result with a small, fixed label set:
+
+- what kind of resource was checked
+- what operation category was requested
+- whether the result was a success or failure
+
+This metric will be exported in two equivalent forms:
+
+- a Prometheus counter for the existing broker metrics endpoint
+- an OpenTelemetry counter for modern metrics pipelines
+
+Invalid original-principal combinations in proxied authorization flows will 
also be counted as authorization failures,
+because they represent rejected authorization attempts from the broker’s 
perspective.
 
 # Detailed Design
 
+## Design & Implementation Details
+
+This proposal introduces a broker authorization metrics helper that owns:
+
+- a Prometheus counter for broker metrics scraping
+- an OpenTelemetry `LongCounter` for broker metrics export
+
+The helper is instantiated by `AuthorizationService`. `AuthorizationService` 
records results after each completed
+authorization decision. If the provider returns `true`, the helper records a 
success. If the provider returns `false`,
+the helper records a failure. If `AuthorizationService` rejects a request 
before provider evaluation, such as an
+invalid original-principal combination for proxied requests, the helper 
records a failure directly.
+
+The instrumentation is attached to the following authorization flows:
+
+- superuser checks
+- tenant-admin checks
+- tenant operations
+- broker operations
+- cluster operations
+- cluster policy operations
+- namespace operations
+- namespace policy operations
+- topic operations
+- topic policy operations
+
+This proposal intentionally keeps the label space small. It does not include 
role names, topic names, tenant names,
+client addresses, provider names, or error strings.
+
 ## Public-facing Changes
 
-Update the PulsarConsumer builder process to include isolation level 
configurations for Read Committed and Read
-Uncommitted.
+### Public API
 
-### Before the Change
+No public API changes.
 
-The PulsarConsumer builder process currently does not include isolation level 
configurations. The consumer creation
-process might look like this:
+### Binary protocol
 
-```
-PulsarClient client = 
PulsarClient.builder().serviceUrl("pulsar://localhost:6650").build();
+No binary protocol changes.
 
-Consumer<String> consumer = client.newConsumer(Schema.STRING)
-        .topic("persistent://my-tenant/my-namespace/my-topic")
-        .subscriptionName("my-subscription")
-        .subscriptionType(SubscriptionType.Shared)
-        .subscribe();
-```
+### Configuration
 
-### After the Change
+No new configuration is required.
 
-Update the PulsarConsumer builder process to include isolation level 
configurations for Read Committed and Read
-Uncommitted. Introduce a new method subscriptionIsolationLevel() in the 
consumer builder, which accepts an enumeration
-value representing the isolation level:
+### CLI
 
-```
-public enum SubscriptionIsolationLevel {
-    // Consumer can only consume all transactional messages which have been 
committed.
-    READ_COMMITTED,
+No CLI changes.
 
-    // Consumer can consume all messages, even transactional messages which 
have been aborted.
-    READ_UNCOMMITTED;
-}
-```
+### Metrics
 
-Then, modify the consumer creation process to include the new isolation level 
configuration:
+Prometheus metric:
 
-```
-PulsarClient client = 
PulsarClient.builder().serviceUrl("pulsar://localhost:6650").build();
+- Full name: `pulsar_authorization_operations_total`
+- Description: Total number of broker authorization operations.
+- Attributes:
+  - `resource_type`
+  - `operation`
+  - `result`
+- Unit: operations
 
-Consumer<String> consumer = client.newConsumer(Schema.STRING)
-        .topic("persistent://my-tenant/my-namespace/my-topic")
-        .subscriptionName("my-subscription")
-        .subscriptionType(SubscriptionType.Shared)
-        .subscriptionIsolationLevel(SubscriptionIsolationLevel.READ_COMMITTED) 
// Adding the isolation level configuration
-        .subscribe();
-```
+OpenTelemetry metric:
 
-With this change, users can now choose between Read Committed and Read 
Uncommitted isolation levels when creating a new
-consumer. If the isolationLevel() method is not called during the builder 
process, the default isolation level will be
-Read Committed.
-Note that this is a subscription dimension configuration, and all consumers 
under the same subscription need to be
-configured with the same IsolationLevel.
+- Full name: `pulsar.authorization.operation.count`
+- Description: Total number of broker authorization operations.
+- Attributes:
+  - `pulsar.authorization.type`
+  - `pulsar.authorization.operation`
+  - `pulsar.authorization.result`
+- Unit: `{operation}`
 
-## Design & Implementation Details
+Attribute values:
 
-### Client Changes
+- `result`: `success` or `failure`
+- `resource_type`: fixed categories such as `topic`, `namespace`, `tenant`, 
`broker`, `cluster`, `superuser`,
+  `tenant_admin`, `topic_policy`, `namespace_policy`, and `cluster_policy`
+- `operation`: the normalized operation name for the authorization check
 
-Update the PulsarConsumer builder to accept isolation level configurations for 
Read Committed and Read Uncommitted levels.
 
-In order to achieve the above goals, the following modifications need to be 
made:
+# Monitoring
 
-- Added `IsolationLevel` related fields and methods in 
`ConsumerConfigurationData` and `ConsumerBuilderImpl` and `ConsumerImpl`
+Operators should monitor both absolute authorization failures and the 
relationship between failures and successes.
+Recommended patterns include:
 
-- Modify PulsarApi.CommandSubscribe, add field -- IsolationLevel
+- Alert on sustained increases in `result="failure"`.
+- Build dashboards that show `success` and `failure` together by 
`resource_type`.
+- Investigate rollout regressions by comparing the failure rate before and 
after authorization policy changes.
+- Distinguish authentication incidents from authorization incidents by 
correlating authorization failures with existing
+  authentication metrics.
 
-```
-message CommandSubscribe {
+This proposal intentionally enables ratio-based alerting, such as 
failure/success comparisons, by including both result
+types in the same metric family.
 
-    enum IsolationLevel {
-    READ_COMMITTED = 0;
-    READ_UNCOMMITTED = 1;
-    }
-    optional IsolationLevel isolation_level = 20 [default = READ_COMMITTED];
-}
-```
+# Security Considerations
 
-### Broker changes
+This proposal improves security observability but does not change 
authorization semantics.
 
-Modify the transaction buffer and dispatching mechanisms to handle messages 
based on the chosen isolation level.
+Because authorization decisions can be high volume and can involve sensitive 
identifiers, the metric must avoid
+high-cardinality or identity-bearing labels. This proposal therefore excludes 
role names, topic names, namespaces,
+tenants, and client network information from metric attributes. That preserves 
operational usefulness without turning
+the metric into a data-leak or cardinality risk.
 
-In order to achieve the above goals, the following modifications need to be 
made:
+Failed proxy original-principal validation is counted as an authorization 
failure because the broker rejects the
+request during authorization handling.
 
-- Determine in the `readMoreEntries` method of Dispatchers such as 
`PersistentDispatcherSingleActiveConsumer`
-  and `PersistentDispatcherMultipleConsumers`:
+# Backward & Forward Compatibility
 
-  - If Subscription.isolationLevel == ReadCommitted, then MaxReadPosition = 
topic.getMaxReadPosition(), that is,
-    transactionBuffer.getMaxReadPosition()
+## Upgrade
 
-  - If Subscription.isolationLevel == ReadUnCommitted, then MaxReadPosition = 
PositionImpl.LATEST
+No special upgrade action is required. The new metrics appear automatically 
after upgrading brokers that include this
+feature.
 
-- Add a new metrics `subscriptionIsolationLevel` in `SubscriptionStatsImpl`.
+## Downgrade / Rollback
 
-# Monitoring
+Downgrading removes the metrics. Monitoring systems should tolerate 
missing-series behavior during rollback.
+
+## Pulsar Geo-Replication Upgrade & Downgrade/Rollback Considerations
+
+No geo-replication protocol or metadata compatibility changes are introduced.
+
+# Alternatives
+
+- Failure-only counter:
+  Rejected because operators often need both success and failure counts to 
interpret changes correctly and to build
+  ratio-based alerts.
+
+- Add detailed identity labels such as role or topic:
+  Rejected due to cardinality and privacy concerns.
+
+- Instrument each authorization call site independently:
+  Rejected because it would be error-prone and would likely produce 
inconsistent semantics across broker paths.
+
+# General Notes
 
-After this PIP, Users can query the subscription stats of a topic through the 
admin tool, and observe the `subscriptionIsolationLevel` in the subscription 
stats to determine the isolation level of the subscription.
+This proposal is intentionally limited to broker metrics. It does not attempt 
to replace audit logging or structured
+security events.
 
 # Links
 
-* Mailing List discussion thread: 
https://lists.apache.org/thread/8ny0qtp7m9qcdbvnfjdvpnkc4c5ssyld
-* Mailing List voting thread: 
https://lists.apache.org/thread/4q1hrv466h8w9ccpf4moxt6jv1jxp1mr
-* Document link: https://github.com/apache/pulsar-site/pull/712
+* Mailing List discussion thread:
+* Mailing List voting thread:

Reply via email to