This is an automated email from the ASF dual-hosted git repository. mattisonchao pushed a commit to branch mattison/pip-298-authorization-metrics in repository https://gitbox.apache.org/repos/asf/pulsar.git
commit e98dd50fbc1ead8fbc96564a6995af9863068d4f Author: mattisonchao <[email protected]> AuthorDate: Mon Apr 13 18:27:09 2026 +0800 Add PIP for authorization operation metrics --- pip/pip-298.md | 309 ++++++++++++++++++++++++++++++--------------------------- 1 file changed, 161 insertions(+), 148 deletions(-) diff --git a/pip/pip-298.md b/pip/pip-298.md index ec154f228c8..7b4123f178b 100644 --- a/pip/pip-298.md +++ b/pip/pip-298.md @@ -1,199 +1,212 @@ -# PIP-298: Support read transaction buffer snapshot segments from earliest +# PIP-298: Authorization Operation Metrics -# Background +# Background knowledge -In the implementation of the Pulsar Transaction, each topic is configured with a `Transaction Buffer` to prevent -consumers from reading uncommitted messages, which are invisible until the transaction is committed. Transaction Buffer -works with Position (maxReadPosition) and `TxnID` Set (aborts). The broker only dispatches messages, before the -maxReadPosition, to the consumers. When the broker dispatches the messages before maxReadPosition to the consumer, the -messages sent by aborted transactions will get filtered by the Transaction Buffer. +Pulsar brokers perform authorization checks before allowing clients, proxies, and administrative callers to access +topics, namespaces, tenants, brokers, and policy operations. These checks are enforced through the broker-side +`AuthorizationService`, which delegates the decision to the configured `AuthorizationProvider`. + +Pulsar already exposes several security-related metrics, especially around authentication. These metrics are useful for +detecting login failures, unhealthy client behavior, and changes in access patterns. However, Pulsar does not expose an +equivalent broker-level metric stream for authorization outcomes. In practice, authorization failures are primarily +visible through logs and request failures rather than through a dedicated metric. + +Pulsar also supports OpenTelemetry metrics. For operational consistency, new broker observability features should align +Prometheus-style metrics with OpenTelemetry counters rather than introducing instrumentation in only one pipeline. # Motivation -Currently, Pulsar transactions do not have configurable isolation levels. By introducing isolation level configuration -for consumers, we can enhance the flexibility of Pulsar transactions. - -Let's consider an example: - -**System**: Financial Transaction System - -**Operations**: Large volume of deposit and withdrawal operations, a -small number of transfer operations. - -**Roles**: - -- **Client A1** -- **Client A2** -- **User Account B1** -- **User Account B2** -- **Request Topic C** -- **Real-time Monitoring System D** -- **Business Processing System E** - -**Client Operations**: - -- **Withdrawal**: Client A1 decreases the deposit amount from User - Account B1 or B2. -- **Deposit**: Client A1 increases the deposit amount in User Account B1 or B2. -- **Transfer**: Client A2 decreases the deposit amount from User - Account B1 and increases it in User Account B2. Or vice versa. - -**Real-time Monitoring System D**: Obtains the latest data from -Request Topic C as quickly as possible to monitor transaction data and -changes in bank reserves in real-time. This is necessary for the -timely detection of anomalies and real-time decision-making. - -**Business Processing System E**: Reads data from Request Topic C, -then actually operates User Accounts B1, B2. - -**User Scenario**: Client A1 sends a large number of deposit and -withdrawal requests to Request Topic C. Client A2 writes a small -number of transfer requests to Request Topic C. - -In this case, Business Processing System E needs a read-committed -isolation level to ensure operation consistency and Exactly Once -semantics. The real-time monitoring system does not care if a small -number of transfer requests are incomplete (dirty data). What it -cannot tolerate is a situation where a large number of deposit and -withdrawal requests cannot be presented in real time due to a small -number of transfer requests (the current situation is that uncommitted -transaction messages can block the reading of committed transaction -messages). - -In this case, it is necessary to set different isolation levels for -different consumers/subscriptions. -The uncommitted transactions do not impact actual users' bank accounts. -Business Processing System E only reads committed transactional -messages and operates users' accounts. It needs Exactly-once semantic. -Real-time Monitoring System D reads uncommitted transactional -messages. It does not need Exactly-once semantic. - -They use different subscriptions and choose different isolation -levels. One needs transaction, one does not. -In general, multiple subscriptions of the same topic do not all -require transaction guarantees. -Some want low latency without the exact-once semantic guarantee, and -some must require the exactly-once guarantee. -We just provide a new option for different subscriptions. - -# Goal +Operators need a low-cardinality, broker-native signal that shows whether authorization checks are succeeding or +failing. This is needed for security alerting, baseline monitoring, and compliance reporting. + +Without a dedicated authorization metric, operators have to infer authorization denials from logs, HTTP status codes, +or client-side errors. That is brittle and does not support standard monitoring patterns such as: + +- Alerting on spikes in authorization failures. +- Comparing authorization failures against successful authorizations. +- Building dashboards that differentiate between authentication problems and authorization problems. +- Exporting equivalent signals through both Prometheus and OpenTelemetry. + +The lack of a generic metric also encourages overly narrow designs such as a failure-only counter. That limits +observability because operators often need both success and failure counts to understand whether a denial spike reflects +an attack, a rollout problem, or a normal traffic shift. + +# Goals ## In Scope -Implement Read Committed and Read Uncommitted isolation levels for Pulsar transactions. Allow consumers to configure -isolation levels during the building process. +- Add a low-cardinality broker authorization metric for operation outcomes. +- Record both successful and failed authorization decisions. +- Expose the metric through Prometheus-compatible broker metrics. +- Expose the same metric through OpenTelemetry. +- Centralize instrumentation in `AuthorizationService` so all broker authorization paths share the same metric model. ## Out of Scope -None. +- Per-role, per-topic, per-tenant, or per-principal labels. +- Audit-log payloads or structured security event streams. +- New authorization APIs or protocol changes. +- Alert rule definitions for downstream monitoring stacks. + # High Level Design -Add a configuration 'subscriptionIsolationLevel' in the consumer builder to allow users to choose different transaction -isolation levels. +Introduce a generic authorization operation counter that is incremented whenever the broker finishes an authorization +decision. + +The metric is recorded centrally in `AuthorizationService`, which already serves as the broker-side entry point for +authorization checks across topic, namespace, tenant, broker, cluster, and policy operations. Each authorization check +will emit one result with a small, fixed label set: + +- what kind of resource was checked +- what operation category was requested +- whether the result was a success or failure + +This metric will be exported in two equivalent forms: + +- a Prometheus counter for the existing broker metrics endpoint +- an OpenTelemetry counter for modern metrics pipelines + +Invalid original-principal combinations in proxied authorization flows will also be counted as authorization failures, +because they represent rejected authorization attempts from the broker’s perspective. # Detailed Design +## Design & Implementation Details + +This proposal introduces a broker authorization metrics helper that owns: + +- a Prometheus counter for broker metrics scraping +- an OpenTelemetry `LongCounter` for broker metrics export + +The helper is instantiated by `AuthorizationService`. `AuthorizationService` records results after each completed +authorization decision. If the provider returns `true`, the helper records a success. If the provider returns `false`, +the helper records a failure. If `AuthorizationService` rejects a request before provider evaluation, such as an +invalid original-principal combination for proxied requests, the helper records a failure directly. + +The instrumentation is attached to the following authorization flows: + +- superuser checks +- tenant-admin checks +- tenant operations +- broker operations +- cluster operations +- cluster policy operations +- namespace operations +- namespace policy operations +- topic operations +- topic policy operations + +This proposal intentionally keeps the label space small. It does not include role names, topic names, tenant names, +client addresses, provider names, or error strings. + ## Public-facing Changes -Update the PulsarConsumer builder process to include isolation level configurations for Read Committed and Read -Uncommitted. +### Public API -### Before the Change +No public API changes. -The PulsarConsumer builder process currently does not include isolation level configurations. The consumer creation -process might look like this: +### Binary protocol -``` -PulsarClient client = PulsarClient.builder().serviceUrl("pulsar://localhost:6650").build(); +No binary protocol changes. -Consumer<String> consumer = client.newConsumer(Schema.STRING) - .topic("persistent://my-tenant/my-namespace/my-topic") - .subscriptionName("my-subscription") - .subscriptionType(SubscriptionType.Shared) - .subscribe(); -``` +### Configuration -### After the Change +No new configuration is required. -Update the PulsarConsumer builder process to include isolation level configurations for Read Committed and Read -Uncommitted. Introduce a new method subscriptionIsolationLevel() in the consumer builder, which accepts an enumeration -value representing the isolation level: +### CLI -``` -public enum SubscriptionIsolationLevel { - // Consumer can only consume all transactional messages which have been committed. - READ_COMMITTED, +No CLI changes. - // Consumer can consume all messages, even transactional messages which have been aborted. - READ_UNCOMMITTED; -} -``` +### Metrics -Then, modify the consumer creation process to include the new isolation level configuration: +Prometheus metric: -``` -PulsarClient client = PulsarClient.builder().serviceUrl("pulsar://localhost:6650").build(); +- Full name: `pulsar_authorization_operations_total` +- Description: Total number of broker authorization operations. +- Attributes: + - `resource_type` + - `operation` + - `result` +- Unit: operations -Consumer<String> consumer = client.newConsumer(Schema.STRING) - .topic("persistent://my-tenant/my-namespace/my-topic") - .subscriptionName("my-subscription") - .subscriptionType(SubscriptionType.Shared) - .subscriptionIsolationLevel(SubscriptionIsolationLevel.READ_COMMITTED) // Adding the isolation level configuration - .subscribe(); -``` +OpenTelemetry metric: -With this change, users can now choose between Read Committed and Read Uncommitted isolation levels when creating a new -consumer. If the isolationLevel() method is not called during the builder process, the default isolation level will be -Read Committed. -Note that this is a subscription dimension configuration, and all consumers under the same subscription need to be -configured with the same IsolationLevel. +- Full name: `pulsar.authorization.operation.count` +- Description: Total number of broker authorization operations. +- Attributes: + - `pulsar.authorization.type` + - `pulsar.authorization.operation` + - `pulsar.authorization.result` +- Unit: `{operation}` -## Design & Implementation Details +Attribute values: -### Client Changes +- `result`: `success` or `failure` +- `resource_type`: fixed categories such as `topic`, `namespace`, `tenant`, `broker`, `cluster`, `superuser`, + `tenant_admin`, `topic_policy`, `namespace_policy`, and `cluster_policy` +- `operation`: the normalized operation name for the authorization check -Update the PulsarConsumer builder to accept isolation level configurations for Read Committed and Read Uncommitted levels. -In order to achieve the above goals, the following modifications need to be made: +# Monitoring -- Added `IsolationLevel` related fields and methods in `ConsumerConfigurationData` and `ConsumerBuilderImpl` and `ConsumerImpl` +Operators should monitor both absolute authorization failures and the relationship between failures and successes. +Recommended patterns include: -- Modify PulsarApi.CommandSubscribe, add field -- IsolationLevel +- Alert on sustained increases in `result="failure"`. +- Build dashboards that show `success` and `failure` together by `resource_type`. +- Investigate rollout regressions by comparing the failure rate before and after authorization policy changes. +- Distinguish authentication incidents from authorization incidents by correlating authorization failures with existing + authentication metrics. -``` -message CommandSubscribe { +This proposal intentionally enables ratio-based alerting, such as failure/success comparisons, by including both result +types in the same metric family. - enum IsolationLevel { - READ_COMMITTED = 0; - READ_UNCOMMITTED = 1; - } - optional IsolationLevel isolation_level = 20 [default = READ_COMMITTED]; -} -``` +# Security Considerations -### Broker changes +This proposal improves security observability but does not change authorization semantics. -Modify the transaction buffer and dispatching mechanisms to handle messages based on the chosen isolation level. +Because authorization decisions can be high volume and can involve sensitive identifiers, the metric must avoid +high-cardinality or identity-bearing labels. This proposal therefore excludes role names, topic names, namespaces, +tenants, and client network information from metric attributes. That preserves operational usefulness without turning +the metric into a data-leak or cardinality risk. -In order to achieve the above goals, the following modifications need to be made: +Failed proxy original-principal validation is counted as an authorization failure because the broker rejects the +request during authorization handling. -- Determine in the `readMoreEntries` method of Dispatchers such as `PersistentDispatcherSingleActiveConsumer` - and `PersistentDispatcherMultipleConsumers`: +# Backward & Forward Compatibility - - If Subscription.isolationLevel == ReadCommitted, then MaxReadPosition = topic.getMaxReadPosition(), that is, - transactionBuffer.getMaxReadPosition() +## Upgrade - - If Subscription.isolationLevel == ReadUnCommitted, then MaxReadPosition = PositionImpl.LATEST +No special upgrade action is required. The new metrics appear automatically after upgrading brokers that include this +feature. -- Add a new metrics `subscriptionIsolationLevel` in `SubscriptionStatsImpl`. +## Downgrade / Rollback -# Monitoring +Downgrading removes the metrics. Monitoring systems should tolerate missing-series behavior during rollback. + +## Pulsar Geo-Replication Upgrade & Downgrade/Rollback Considerations + +No geo-replication protocol or metadata compatibility changes are introduced. + +# Alternatives + +- Failure-only counter: + Rejected because operators often need both success and failure counts to interpret changes correctly and to build + ratio-based alerts. + +- Add detailed identity labels such as role or topic: + Rejected due to cardinality and privacy concerns. + +- Instrument each authorization call site independently: + Rejected because it would be error-prone and would likely produce inconsistent semantics across broker paths. + +# General Notes -After this PIP, Users can query the subscription stats of a topic through the admin tool, and observe the `subscriptionIsolationLevel` in the subscription stats to determine the isolation level of the subscription. +This proposal is intentionally limited to broker metrics. It does not attempt to replace audit logging or structured +security events. # Links -* Mailing List discussion thread: https://lists.apache.org/thread/8ny0qtp7m9qcdbvnfjdvpnkc4c5ssyld -* Mailing List voting thread: https://lists.apache.org/thread/4q1hrv466h8w9ccpf4moxt6jv1jxp1mr -* Document link: https://github.com/apache/pulsar-site/pull/712 +* Mailing List discussion thread: +* Mailing List voting thread:
